Benchmark highlights
Building on Nemotron Nano V2 VL, Nemotron 3 Nano Omni delivers substantial visual gains and adds entirely new audio and video+audio capabilities - while also leading another open-weights omni model, Qwen3-Omni, in many domains.
| Task | Benchmark | Nemotron 3 Nano Omni | Nemotron Nano V2 VL | Qwen3-Omni 30B-A3B |
|---|---|---|---|---|
| Document understanding | OCRBenchV2-En | 65.8 | 61.2 | - |
| MMLongBench-Doc | 57.5 | 38.0 | 49.5 | |
| CharXiv reasoning | 63.6 | 41.3 | 61.1 | |
| GUI | ScreenSpot-Pro | 57.8 | 5.5 | 59.7 |
| OSWorld | 47.4 | 11.0 | 29.0 | |
| Video understanding | Video-MME | 72.2 | 63.0 | 70.5 |
| Video + Audio understanding | WorldSense | 55.4 | - | 54.0 |
| DailyOmni | 74.1 | - | 73.6 | |
| Voice interaction | VoiceBench | 89.4 | - | 88.8 |
| ASR | HF Open ASR (lower is better) | 5.95 | - | 6.55 |
Efficiency highlights
Compared to other open omni models with the same interactivity, Nemotron 3 Nano Omni delivers 7.4x higher system efficiency for multi-document use cases and 9.2x higher system efficiency for video use cases
Figure 1. Total system throughput for multi-document and video use cases sustained by each model at a fixed per‑user interactivity threshold (tokens/sec/user)
At a high level, Nemotron 3 Nano Omni is aimed at five classes of workloads:
This is not only about OCR. The model is positioned for long, messy, high-value documents where understanding depends on layout, tables, figures, formulas, section structure, and cross-page references. Think contracts, technical papers, reports, manuals, multi-page forms, or compliance packets. The model can handle 100+ page documents.
Nemotron 3 Nano Omni includes strong speech understanding capabilities that enable high-quality transcription across diverse audio conditions. It handles long-form audio with varying speakers, accents, and background noise. These capabilities can be integrated into broader workflows, allowing spoken content to be transcribed, analyzed, and combined with other modalities for tasks like summarization, question answering, and cross-modal reasoning.
Many enterprise and developer workflows depend on mixed audio and visual evidence: screen recordings with narration, training videos, meetings with slides, tutorials, product demos, customer support captures, and long-form video archives. Nemotron 3 Nano Omni is built to reason over those inputs jointly.
The Nemotron 3 Nano Omni model is specifically trained for agentic computer use, enabling it to assist with tasks in graphical user interface (GUI) environments. Its capabilities include interpreting screenshots, monitoring the state of the user interface, grounding its reasoning in on-screen visuals, and helping with action selection or workflow automation.
The model is designed for more than perception. It excels at reasoning-intensive tasks that require synthesizing information across long context windows, multiple modalities, and structured or semi-structured evidence. It can carry out multi-step reasoning, perform calculations, and connect signals from text, images, tables, and other inputs to arrive at coherent, well-supported answers.
Nemotron 3 Nano Omni uses a unified encoder-projector-decoder design. The language backbone is Nemotron 3 Nano 30B-A3B, paired with the C-RADIOv4-H vision encoder and the Parakeet-TDT-0.6B-v2 audio encoder. The modality-specific encoders connect into the LLM backbone through lightweight projectors.
Figure 2. Model architecture of NVIDIA Nemotron 3 Nano Omni 30B-A3B
The model backbone interleaves three key components: 23 Mamba selective state-space layers for efficient long-context processing; 23 MoE layers with 128 experts, top-6 routing, and a shared expert for conditional capacity; and 6 grouped-query attention layers to preserve strong global interaction and expressivity.
Nemotron 3 Nano Omni combines state-space models, attention, and MoE in a unified design that maintains strong reasoning performance while remaining practical for long, multimodal contexts.
On the vision side, the Nemotron 3 Nano Omni replaces the tiling strategy used in the v2 model with dynamic resolution processing at native aspect ratio. Each image can be represented using a variable number of 16 x 16 patches, with a minimum of 1,024 to a maximum of 13,312 visual patches per image. For square images, this is equivalent to 512 x 512 and 1840 x 1840, respectively.
That flexibility is critical for handling high-resolution, complex visual inputs such as OCR-heavy documents, financial tables, slides, research figures, screenshots, and GUI layouts—especially when both fine details and overall structure need to be understood together.
For video, Nemotron 3 Nano Omni uses a dedicated Conv3D tubelet embedding path. Instead of embedding each frame independently, every pair of consecutive frames is fused into a single "tubelet" before the ViT, halving the number of vision tokens the language model has to attend to. This allows us to either double the number of frames with the same token budget, or halve the number of tokens with the same number of frames
EVS is an important feature, enabled during inference time, that drops redundant video tokens after the vision encoder. This reduces latency and improves throughput while maintaining accuracy. The first frame of the video is kept entirely, then for each subsequent frame, EVS keeps the “dynamic” tokens where the video is changing and drops the “static” ones where nothing has changed from the previous frame. We combine this with Conv3D to enable superior compression: Conv3D fuses tokens from pairs of frames into one, and then EVS prunes redundant static information.
The audio side is powered by Parakeet-TDT-0.6B-v2, connected to the backbone through its own 2-layer MLP projector. Audio is sampled at 16 kHz, and the model is trained with inputs up to 1,200 seconds (20 minutes), while the LLM max context length supports 5+ hours.
This represents a shift from traditional VLM pipelines by enabling native audio processing within a shared multimodal sequence, allowing audio, visual, and text tokens to be jointly modeled. This is crucial for scenarios like narrated screen recordings, video Q&A where speech alters visual meaning, long-form instructional or meeting content, and tasks requiring temporally grounded multimodal reasoning.
Each encoder is connected to the LLM with a lightweight 2-layer MLP projector that maps encoder features into the shared embedding space. Once projected, vision, audio, and text tokens are interleaved and processed jointly.
This design keeps the overall system modular while still enabling genuine cross-modal reasoning inside the backbone itself.
The SFT stages are trained on NVIDIA H100, scaling from 32 to 128 nodes depending on the stage. The stack uses Megatron-LM, Transformer Engine, and Megatron Energon, with tensor parallelism, expert parallelism, sequence parallelism, context parallelism for the long context stages, online sequence packing, and selective activation recomputation.
Post-SFT reinforcement learning uses NeMo-RL and NeMo Gym with a Megatron backend. The RL infrastructure used a Ray-based distributed setup across B200 and H100 clusters, plus multimodal deduplication, so repeated rollouts do not multiply image, video, and audio memory.
We open-source substantial parts of our training code.
We introduce multi-environment text and omni training in Nemotron 3 Nano Omni. Our text RL training stage happens across diverse environments in Nemo-Gym, which evaluates the model’s ability to perform sequences of actions such as tool calling, writing code, and multi-part planning that satisfy the verifiable criteria.
Omni RL trains the model to reason across images, video, audio, and text within a unified framework, covering tasks from single-modality to fully multimodal scenarios. A diverse verifier suite evaluates outputs across formats like multiple-choice, math, GUI grounding, and ASR, while intentionally including unanswerable cases to teach the model to abstain when evidence is insufficient rather than hallucinate.
Nemotron 3 Nano Omni is trained on an enhanced dataset that emphasizes high-quality reasoning across multiple modalities. We significantly expand task coverage and introduce synthetic data for complex reasoning scenarios where public datasets are limited. To support this, we build task-specific, multi-stage pipelines for scalable synthetic data generation.
As one example, we generated approximately 11.4M synthetic QA pairs (~45B tokens) from a large corpus of real-world PDFs using NeMo Data Designer. This dataset is used to strengthen long-context document reasoning during post-training and delivers a 2.19× improvement in overall accuracy on MMLongBench-Doc.
We detail the full pipeline evolution, including failure analysis and key lessons learned, in our Data Designer developer note. The note also includes nine runnable pipeline recipes that can serve as starting points for building your own document-understanding datasets.
Nemotron 3 Nano Omni can analyze and reason over long documents such as financial reports, academic papers, product manuals, etc. The following example retrieves financial metrics across a 100+ page document to calculate another metric:
Figure 3: Stylized example from the MMLongBench-Doc benchmark
The model was prompted to answer the question as follows:
Extract information from all the given images, then answer the question using a single word or phrase. Return 'Not answerable' if the answer cannot be derived from the the images.
The model is capable of long-context retrieval, structured extraction, table/chart reading and multi-page reasoning all in one go.
Nemotron-3 Nano Omni performs joint audio-visual analysis, both locally for specific scenes and globally across the entire video. This enables it to answer complex questions that require cross-modal reasoning—such as identifying specific visuals shown exactly when a certain topic is mentioned in the audio.
Video
Question
Watch the video and listen to the narration.
1.What structure is on fire as shown in the video and how much money was being spent in its renovation project ?
2. Describe in short what visuals are shown when the eye-witness narrates her experience?
Answer
1. The structure on fire is the Notre Dame Cathedral, which was undergoing an almost $7 million renovation project.
2. When the eye-witness narrates her experience, the visuals show the cathedral engulfed in flames and thick smoke, scaffolding surrounding the burning roof, firefighters using ladders to spray water, a crowd of people watching from the riverbank, and views of the burning structure against the Paris skyline.
Nemotron 3 Omni can be integrated into agentic computer-use systems to reason over user intents, analyze GUI elements, and execute actions to accomplish tasks.
The model was given access to a web environment with the following system prompt:
You are a GUI agent. You are given an instruction, a screenshot of the screen and your previous interactions with the computer. You need to perform a series of actions to complete the task. The passoword of the computer is {password}.
For each step, provide your response in this format:
{thought}
## Action:
{action}
## Code:
{code}
In the code section, the code should be either pyautogui code or one of the following functions wrapped in the code block:
- {"name": "computer.wait", "description": "Make the computer wait for 20 seconds for installation, running code, etc.", "parameters": {"type": "object", "properties": {}, "required": []}}
- {"name": "computer.terminate", "description": "Terminate the current task and report its completion status", "parameters": {"type": "object", "properties": {"status": {"type": "string", "enum": ["success", "failure"], "description": "The status of the task"}, "answer": {"type": "string", "description": "The answer of the task"}}, "required": ["status"]}}
User Instruction
Find the Driver License Eligibility Requirements
Model Response:
The GIF above shows the actions that the agent takes to fulfill the user’s request
It clicks on the “License & IDs” tab
Then it clicks on “Driver’s License” from the dropdown menu
Then it clicks on “Driver’s License Eligibility” on the left
It scrolls down on this page and back up
Finally it produces the following response for the user:
Successfully found the Driver License Eligibility Requirements page on the Virginia DMV website. The page contains comprehensive information about driver license eligibility including:
General Requirements:
First Time Drivers:
New Residents:
Military Personnel:
Out-of-State Students:
Non-Residents:
Identification and Residency Requirements:
Additional resources available: interactive document guide, "Acceptable Documents for Obtaining a Driver's License or Photo ID Card" (DMV 141), and Spanish language version.
Nemotron-3 Nano Omni can analyze charts, figures shown in images, along with audio files to generate any commonalities, discrepancies across the media.
Audio
Question
Use both the slide content and the spoken commentary.
List out any extra points mentioned by the speaker that is not shown in the slide
Answer
Based on the speaker's commentary and the visual content of the slide, here are the extra points mentioned that are not explicitly shown or detailed in the slide:
* **The Topic:** The speaker explicitly states she will be covering **multitask learning** with neural networks, while the slide title is simply "Some notation" (though it contains the mathematical definitions for tasks).
* **Nature of the Output:** She describes the output ($y$) as a **probability distribution** over possible labels (e.g., tiger, tiger cat, lynx, cat), whereas the slide only shows a bar chart with class labels without explicitly labeling the vector as a distribution.
* **Specific Task Example (Paper Titles):** She introduces a concrete example where the input is the **title of a paper** and the neural network is tasked with estimating the **length of the paper**. The slide shows "length of paper" near the bar chart, but does not visually distinguish how a paper title maps to these bars versus representing animal classes.
* **Optimization Techniques:** She mentions that the loss function is typically optimized using **stochastic gradient descent** or a **"favorite optimizer,"** which are procedural steps not written on the slide (which only defines the loss function $\mathcal{L}$ to minimize).
* **Input Diversity:** She specifies that inputs can be a **"wide range of unstructured inputs,"** specifically mentioning **images** and **text**. The slide visually displays a single image (a structured input) and uses a generic variable $x$ for the input class.
Nemotron-3 Nano Omni can analyze charts, figures shown in images, along with audio files to generate any commonalities, discrepancies across the media.
Audio
Question
What is producing the main sound in this audio, and what does the acoustic environment suggest about where it was recorded?
Answer
The main sound is a bird chirping, and the presence of multiple bird calls and a natural, open acoustic environment suggests the recording was made outdoors in a natural setting like a forest or park.
Nemotron-3-Nano-Omni can interpret and reason about general audio, including soundscapes and environmental sounds.
Audio
Question
What's the overall vibe of this song, and what kind of scenario would it complement well?
Answer
The song has a calm, reflective, and slightly melancholic vibe, with a gentle piano melody and soft string accompaniment. It would complement a quiet, introspective scene such as a rainy evening, a peaceful walk, or a moment of personal reflection.
| Hugging Face BF16 checkpoint | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
|---|---|
| Hugging Face FP8 checkpoint | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 |
| Hugging Face NVFP4 checkpoint | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 |
| Technical report / PDF | https://arxiv.org/abs/2604.24954 |
| Dataset / collection release | https://huggingface.co/datasets/nvidia/Nemotron-Image-Training-v3 |
| Megatron-Bridge | https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/vlm/nemotron_3_omni |
| Nemo-RL | https://github.com/NVIDIA-NeMo/RL/blob/nano-v3-omni/docs/guides/nemotron-3-nano-omni.md |
| NeMo Data Designer SDG recipes | https://github.com/NVIDIA-NeMo/DataDesigner/tree/main/docs/assets/recipes/vlm_long_doc |