Authors: Enzo Ruedas, Tess Boivin
Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the integration of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Action (VLA) models. Deploying these models on embedded robotic platforms remains a challenge due to tight constraints in terms of compute, memory, and power, as well as real-time control requirements.
In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands leading to oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. However, to be effective, the end-to-end inference latency must remain shorter than the action execution duration. This temporal constraint therefore sets an upper limit on the model's throughput.
Bringing VLA models to embedded platforms is not a matter of model compression, but a complex systems engineering problem requiring architectural decomposition, latency-aware scheduling, and hardware-aligned execution. Addressing these challenges is essential to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems.
This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, fine‑tuning VLA policies (ACT and SmolVLA), and hightlights the real-time performance that NXP i.MX 95 SoC achieves after optimization.
High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas.
In our case, we recorded a dataset for the task: "Put the tea bag in the mug."
Moving from scene‑only views to mixed viewpoints increases the global accuracy, but the more cameras you have the more the latency is impacted. Therefore, you must choose right compromise. In our case that balance was reached with 3 cameras:
| Top | Gripper | Left |
|---|---|---|
![]() |
![]() |
![]() |
| The global view of the whole scene. | The closest view for precise grasps and alignment. | Complement the top view for height and depth. |
We strongly recommend using a gripper-mounted camera. It consistently improves success rates on fine manipulation tasks by providing a close, task-relevant viewpoint. Importantly, it is also the camera that most effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception rather than observing the scene directly.
When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to prevent it from obstructing the field of view or becoming disconnected during motion.
Simple hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability.
When recording a dataset, you should:
Vary episodes distribution: Divide your workspace into starting-position clusters, and record at least 10 episodes per cluster. Add diversity by changing the object position and rotation.
e.g. we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm.
Differentiate training & validation sets: Policies can easily overfit on the training set, so make sure that the validation set is unseen by the model.
e.g. we removed cluster 6 from the training set.
Record the most movements you can: Small VLA models exhibit limited generalization on unseen motion. Therefore, record episodes that cover the wider ranges of degrees of freedom.
e.g. we grasped the tea bag either in horizontal or vertical position.
Anticipate failure: Sometimes the policy will not reach the object the first time and will have to "go back to it". We noticed that having 20% of all episodes that corresponds to the case of going back to the object help the model improve overall success rate.
e.g. around 20% of our training set corresponds to recovery episodes.
This mirrors best practices across VLA papers and community guides. Here are 3 examples of data diversity within the same cluster:
| Starting position 1 | Starting position 2 | Recovery episode |
|---|---|---|
![]() |
![]() |
![]() |
Starting positions 1 and 2 correspond to different positions within the same cluster. In contrast, during the recovery episode, the robot does not begin in "starting mode"; but is instead already near the mug and should proceed directly to retrieve the tea bag from that location.
What we did in practice:
The range providing the best trade-off between accuracy, generalization, and motion smoothness across both the training and validation sets was found for ACT (100 actions per chunk) within a 100k-160k training steps. For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training slightly past the point where the model begins to overfit tends to improve overall accuracy.
Rule of thumb: choose final checkpoint by evaluating success on both training and validation set, not by training loss.
The i.MX 95 SoC integrates 6× Arm Cortex‑A55, a Cortex‑M7 and a Cortex M33 MCU, a Mali GPU, a new NXP ISP, and the eIQ® Neutron NPU, targeting efficient, secure edge inference with multi‑camera support and strong I/O. [nxp.com]
Instead of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and action experts. Therefore, allowing each component to be optimized, scheduled, and deployed independently.
In practice, SmolVLA is partitioned into the following sub-blocks:
This separation allows per-block optimizations. The impact of each block quantization can be measured to choose the best tradeoff between latency and accuracy. Also, isolating the action expert from the VLM was ideal to run it at lower frequency.
In order to optimize the inference for i.MX 95 SoC, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow in the action expert significantly degrades performance.
This behaviour is expected, as quantization errors are accumulating across iterative denoising steps.
That is why we decided to keep this block at higher precision to preserve stability, while on the other blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers.
In addition, we applied in-house optimization on the different blocks. Results are shown in the below table, referred as optimized models.
In a synchronous control loop, the pipeline operates as:
During step (2), the robot remains idle. If inference latency is non-negligible, this produces:
With Asynchronous Inference, action generation runs in parallel with execution:
This increases effective control frequency, reduces observation staleness, and improves recovery behavior.
On embedded platforms such as the i.MX 95 SoC, asynchronous inference is essential — but only effective if inference latency is kept under the action horizon budget: inference time < execution time
| Synchronous inference | Asynchronous inference | |
|---|---|---|
| Actions per chunk | 100 | 100 |
| FPS | 60 | 60 |
| Chunk size threshold | N/A | 0.2 |
| Aggregate function | N/A | weighted_average |
| Action queue evolution | ![]() |
![]() |
| Results |
Setup
| Platform (CPU) | Policy | Format | Inference Latency | Accuracy Test Set (20) | Accuracy Validation Set (10) | Global Accuracy (30) |
|---|---|---|---|---|---|---|
| i.MX 95 | ACT | ONNX FP32 | 2.86 s | 1.00 | 0.90 | 0.96 |
| i.MX 95 | ACT | Optimized | 0.32 s | 1.00 | 0.60 | 0.89 |
| i.MX 95 | SmolVLA | ONNX FP32 | 29.1 s | 0.50 | 0.40 | 0.47 |
Our immediate objective is to improve task accuracy with SmolVLA (ONNX FP32). We have already established a baseline and measured an optimized on-board inference latency of 6.15 s.
The next phase will focus on deeper optimizations on our NPUs. In parallel, we aim to move from single-task setup toward longer-horizon and more complex scenarios. To do that, we will introduce:
The goal is to move from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems.
Recording
Training
Deployment on i.MX 95 SoC