If you are building a RAG (Retrieval-Augmented Generation) system, you have likely hit this wall: Everything works… until it doesn’t. General-purpose embedding models are trained to understand the internet; not your contracts, manufacturing logs, proprietary chemical formulations or internal taxonomy. They capture broad semantic similarity, but they do not understand the fine-grained distinctions that matter in your domain. Fine-tuning an embedding model can improve the performance of your retrieval pipeline when off-the-shelf models fail to effectively capture domain-specific nuances. Despite how critical embeddings are to RAG performance, the process remains surprisingly fragmented, the skills required are specialized, and the time investment is daunting.
With a single GPU and less than a day of training time, you can transform a general-purpose embedding model into one that truly understands your domain, no manual labeling required. To help you hit the ground running, we are also releasing a ready-to-use synthetic training dataset generated from NVIDIA's public documentation using this exact pipeline. Using this data and the recipe, we saw over 10% improvement in both Recall@10 and NDCG@10. Atlassian applied this recipe to fine-tune on their JIRA dataset, increasing Recall@60 from 0.751 to 0.951, a 26% improvement - on a single GPU.
By the end of this post, you’ll know how to answer:
📄 Generate training data from domain documents without labeled data
🎯 Use hard negative mining for effective contrastive training
🔗 Improve embedding quality with multi-hop queries
⚙️ Fine-tune a bi-encoder embedding model
📊 Evaluate whether fine-tuning improves retrieval
🚀 Deploy the fine-tuned model in your pipeline
In this tutorial, we will finetune the base model Llama-Nemotron-Embed-1B-v2 - a 1-billion-parameter embedding model that balances quality and inference cost. To get started, follow this setup guide.
Fine-tuning an embedding model requires thousands of (query, relevant document) pairs. Most use cases don’t have this data readily available. Creating it manually is expensive, slow, and often biased by the annotator’s personal interpretation of what’s “relevant.”
Instead of labeling data by hand, you can use an LLM (nvidia/nemotron-3-nano-30b-a3b) to read your documents and automatically generate high-quality synthetic question–answer pairs.
nemotron embed sdg -c default corpus_dir=./data/my_domain_docs
Behind the scenes, this runs a four-stage synthetic data generation (SDG) pipeline powered by NeMo Data Designer:
Source document chunk:
The thermal design power (TDP) of the H100 GPU is 700W in SXM form factor. The cooling solution must maintain junction temperature below 83°C under sustained workloads. Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.
Generated QA pairs:
{
"question": "What cooling approach is recommended when deploying more than 4 H100 GPUs per server node?",
"answer": "Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.",
"query_type": "contextual",
"reasoning_type": "factual",
"question_complexity": 3,
"segment_ids": [1],
"quality_score": 8.5
}
{
"question": "How does the 700W TDP of the H100 SXM constrain the choice between air and liquid cooling in multi-GPU configurations?",
"answer": "The 700W TDP generates substantial heat that must be dissipated to keep junction temperatures below 83°C. In dense configurations exceeding 4 GPUs per node, air cooling in standard 2U chassis cannot handle this thermal load, making liquid cooling necessary.",
"query_type": "multi_hop",
"reasoning_type": "causal",
"question_complexity": 4,
"segment_ids": [1, 2],
"hop_count": 2,
"quality_score": 9.0
}
Notice the difference: the first question is a simple factual lookup. The second requires multi-hop, causal reasoning. The pipeline generates both types, with configurable complexity levels (2–5) and hop counts (1–3). Each QA pair then undergoes quality evaluation, receiving sub-scores for relevance, accuracy, context support, and clarity, along with an overall score. Only pairs that meet the threshold are included in training.
If you train an embedding model with only positive pairs (query + correct document), it learns to distinguish obviously different documents but fails on the hard cases — passages that look relevant but are not the right answer. In a real retrieval system, these near-misses are exactly the documents that cause bad answers. Hard negative mining finds these confusing passages so the model can learn to tell them apart.
nemotron embed prep -c default
The above command runs three sub-steps automatically:
The generated QA pairs are split into training (80%) and test (20%) sets. The test set is formatted as a BEIR-compatible benchmark for standardized evaluation in Step 5.
Using the base embedding model, the pipeline:
The result: hard negatives are the most similar non-positive passages that still fall safely below the positive-score ceiling. They are passages the current model considers highly relevant but that are not the labeled answer.
Why this works: Training on easy negatives (completely unrelated passages) teaches the model nothing new. Training on hard negatives forces it to learn the subtle distinctions that matter in your domain. For example, in a medical corpus, a question about "metformin dosage for Type 2 diabetes" might have hard negatives about "metformin side effects" or "insulin dosage for Type 1 diabetes" — close but critically different. The 95% margin ceiling prevents the miner from selecting passages that are too close to the positive, which could actually be correct answers that simply weren't labeled during SDG.
Multi-hop questions reference multiple positive documents. For example, a question like "How does the thermal management system in Section 3.2 relate to the power constraints described in Section 5.1?" has two positive passages.
Unrolling creates one training example per (query, positive document) pair, so the contrastive loss sees each positive independently. A question with 2 positive documents becomes 2 training examples, each with the same hard negatives but a different positive.
The final output is a training-ready JSON file:
{
"question_id": "q42_0",
"question": "How does the 700W TDP of the H100 SXM constrain cooling choices in multi-GPU nodes?",
"pos_doc": [{"id": "d_a1b2c3"}],
"neg_doc": [{"id": "d_x7y8z9"}, {"id": "d_m4n5o6"}, {"id": "d_p1q2r3"}, {"id": "d_s4t5u6"}, {"id": "d_v7w8x9"}]
}
Standard embedding fine-tuning generates one question per passage and trains the model to match them. This works for simple factual lookups, but real users ask complex questions that span multiple documents or sections. If the model has only seen single-hop training data, it will struggle to retrieve all the relevant passages for these complex queries.
The SDG pipeline generates questions at 1 to 3 hops by default:
Each hop is tracked with its own context summary and segment IDs, so the training data preserves the full reasoning chain. After unrolling (Step 2c), each (question, relevant passage) pair becomes an independent training signal, teaching the model that all of these passages are relevant to the multi-hop query.
The fine-tuned model learns to retrieve contextually related documents, not just lexically similar ones.
nemotron embed finetune -c default
The training uses a biencoder architecture with contrastive loss.
The temperature of 0.02 is deliberately aggressive, it produces a very sharp probability distribution. This works well because the hard negatives from Step 2 are high-quality: they are genuinely confusing passages that the model needs strong gradients to learn to distinguish.
| Parameter | Default | Notes |
|---|---|---|
| Epochs | 3 | For large dataset, you may lower it to 2 or 1 |
| Learning rate | 1e-5 | Tuning: try double and half of the default value |
| Learning rate warmup steps | 5 | Set to 5-10% of total steps of finetune to have better early training stability |
| Global batch size | 128 | Auto-scaled down for small datasets |
| Passages per query | 5 | 1 positive + 4 hard negatives |
If your dataset has fewer than 2,000 training examples, the pipeline automatically:
This means you can start with a small corpus (50–100 documents) for a quick proof-of-concept and scale up later.
Did fine-tuning actually help? Let’s find out by running a standardized evaluation comparing the base model against the fine-tuned checkpoint on the held-out test set:
nemotron embed eval -c default
The evaluation uses the BEIR framework and computes four standard information retrieval metrics at k = 1, 5, 10, and 100:
A successful fine-tune typically results in a 15% improvement in nDCG@10 and Recall@10 within <1 day.
Results using Retrieval Synthetic NVDocs:
📊 Comparison (Base -> Fine-tuned)
============================================================
NDCG:
NDCG@1: 0.55178 → 0.60796 (+0.05618, +10.2%)
NDCG@5: 0.51894 → 0.57689 (+0.05795, +11.2%)
NDCG@10: 0.55506 → 0.61559 (+0.06053, +10.9%)
NDCG@100: 0.60617 → 0.66567 (+0.05950, +9.8%)
Recall:
Recall@1: 0.28478 → 0.31547 (+0.03069, +10.8%)
Recall@5: 0.54486 → 0.60288 (+0.05802, +10.6%)
Recall@10: 0.62979 → 0.69296 (+0.06317, +10.0%)
Recall@100: 0.81421 → 0.87020 (+0.05599, +6.9%)
The pipeline makes it easy to iterate:
This recipe has been validated on real enterprise data by Atlassian. They applied this pipeline to fine-tune Llama-Nemotron-Embed-1B-v2 on a public Jira dataset using a single NVIDIA A100 80GB GPU, following the same stages described above
Recall@60 jumped from 0.751 to 0.951 — a 26.7% gain.
The fine-tuned model retrieves the correct document within the top 60 results for 95.1% of queries, up from 75.1% with the base model. For a retrieval system underpinning Jira search, this directly translates into more relevant results for millions of users. Find more details in their blog post Advancing semantic search for millions of Rovo users.
A PyTorch checkpoint is great for evaluation but too slow for production. The final two stages convert the model and serve it behind an API.
nemotron embed export -c default
This exports the fine-tuned checkpoint to ONNX (opset 17). Optionally, it compiles a TensorRT engine for maximum inference throughput, with configurable optimization profiles for batch size (1–64) and sequence length (3–256):
# ONNX only (runs anywhere)
nemotron embed export -c default export_to_trt=false
# FP8 quantization for further speedup
nemotron embed export -c default quant_cfg=fp8
The exported model is deployed inside an NVIDIA NIM container — a production-ready inference microservice exposing an OpenAI-compatible /v1/embeddings endpoint:
nemotron embed deploy -c default
Once running, any client can call it:
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": ["What cooling is needed for 8 H100 GPUs in a 2U chassis?"],
"model": "custom",
"input_type": "query"}'
Because NIM serves an OpenAI-compatible API, you can drop it into any existing RAG pipeline that uses the embeddings API format — no code changes needed.
The pipeline includes a NIM accuracy verification step that runs the same BEIR evaluation against the deployed endpoint:
nemotron embed eval -c default eval_nim=true eval_base=false
This catches any accuracy loss from the ONNX/TensorRT conversion. Metrics that match within tolerance (0.03 for @1, 0.01 for @5+) are marked with a check; deviations beyond conversion noise are flagged.
The full embedding fine-tuning pipeline can be run in six commands, from raw documents to a deployed model.
# 1. Generate synthetic training data from your documents
nemotron embed sdg -c default corpus_dir=./data/my_docs
# 2. Prepare the training data (split data, mine hard negatives, unroll)
nemotron embed prep -c default
# 3. Fine-tune the embedding model
nemotron embed finetune -c default
# 4. Evaluate the base vs. fine-tuned model
nemotron embed eval -c default
# 5. Export the optimized model
nemotron embed export -c default
# 6. Deploy the model
nemotron embed deploy -c default
| Stage | GPU Required? | Estimated Time | Notes |
|---|---|---|---|
| SDG | No (uses API) | ~1 hour | Varies by corpus size and API rate limit |
| Data Prep | Yes (40 GB VRAM) | ~5 min | Hard negative mining on GPU |
| Fine-Tune | Yes (80 GB VRAM) | ~1 hours | Varies by dataset size and epochs |
| Eval | Yes (40 GB VRAM) | ~5 min | |
| Export | Yes (40 GB VRAM) | ~5 min | TensorRT requires NGC container |
| Deploy | Yes (40 GB VRAM) | ~5 min | NIM container startup |
Total: under a day, with most time being hands-off training. For a small corpus (~500 documents), the entire pipeline completes in about 2–3 hours.
The pipeline can run end-to-end, but each stage can also be executed independently depending on your starting point. For example, if you have raw documents, you can begin with synthetic data generation (SDG), while datasets that already include hard negatives can skip earlier steps and go directly to fine-tuning. Since every stage uses standard formats such as JSON, BEIR, and ONNX, it’s easy to integrate custom components or reuse intermediate outputs in other workflows. The recipe is also flexible in how it runs, supporting execution on a local machine, inside Docker containers, or on Slurm-based clusters.
If you have domain documents and some time in your hand, you can generate your first batch of synthetic training data today! The full pipeline - from documents to a deployed, domain-adapted embedding model - runs in under a day on a single GPU. You can start with our ready-made nvidia/Retrieval-Synthetic-NVDocs-v1 dataset to try the pipeline right away. Let us know what you build.
Star the repos for Nemotron, NeMo Data Designer and NeMo Automodel if you find them useful.