In July 2025, Capgemini published a global report projecting that AI agents will deliver more than $450B in economic value by 2028, with 96% of surveyed organizations already experimenting with agentic systems. Industry momentum is unmistakable: enterprises are rapidly shifting from passive large language model (LLM) chatbots to active, tool-using AI agents that can retrieve data, execute workflows, and make decisions.
NVIDIA’s position aligns with this shift. In its paper Small Language Models Are the Future of Agentic AI, NVIDIA argues that specialized Small Language Models (SLMs) – not monolithic generalist LLMs – will form the backbone of enterprise agents.
But the most valuable agents are not limited to text. Real-world deployments increasingly require multimodal understanding, particularly in robotics and autonomous systems where perception drives action. Here, NVIDIA’s Cosmos Stack provides the foundation:
- Cosmos Reason 1[GU1] (7B) for physical, common-sense reasoning across time and space, and
- Cosmos-Embed1 for efficient video and multimodal retrieval.
Yet both SLMs and small vision language models (VLMs) for physical applications carry known limitations.
- SLMs, despite their efficiency, often show reduced tool-calling accuracy, especially for vision-heavy tasks.
- Cosmos Reason 1, while strong on short-form video and physical reasoning, struggles with long-horizon temporal understanding due to its training distribution.
Our project, a collaboration between NVIDIA and Clemson University, set out to evaluate and patch these weaknesses. We pursued a two-part investigation:
- Finetune Qwen2.5-7B-Instruct to improve multimodal tool calling for enterprise video retrieval, demonstrating that small models can serve as reliable agent controllers.
- Extend Cosmos Reason 1’s short-form training bias using Cosmos-Embed 1 inside a RAG + agentic architecture, enabling long-form video understanding without retraining the underlying model.
Together, these efforts demonstrate a central thesis: Small Language Models, when paired with retrieval and targeted finetuning, can deliver competitive agentic performance across both enterprise and physical AI domains.
This post details how we built, evaluated, and validated these improvements.
Approach
To evaluate and extend the capabilities of small multimodal models, we built our system with NVIDIA Metropolis VSS and NVIDIA NeMo Agent Toolkit (NAT) – an agent coordination framework designed to orchestrate tool-using LLMs with minimal overhead. [GU2] NAT integrates cleanly with existing agent ecosystems such as LangChain, while providing several advantages critical to this project: built-in agent evaluation, prompt optimization utilities, structured tool-trajectory visualization, and a lightweight UI for rapid iteration.
We hosted all experimentation on NVIDIA Brev, deploying multiple A100 instances to support both model fine tuning and long-video RAG indexing workloads. From this shared infrastructure, the collaboration proceeded along two parallel technical paths:
- Finetuning Path: Improving SLM Tool Calling
This path focused on enhancing the reliability of small language models as agent orchestrators. We investigated whether targeted optimization and reinforcement-based finetuning could improve structured tool invocation, positioning small models as viable controllers for enterprise agent workflows.
- Physical AI Path: Patching Short-Form Video Bias
The physical AI path focused on extending short-form reasoning models to better understand long-form videos. Instead of scaling model size or increasing context length, we approached temporal reasoning as a coordination problem. NAT provided the structure we needed to treat long-video understanding as an agentic workflow, where retrieval and reasoning are composed iteratively rather than executed in a single pass.
Within NAT, long videos are broken down into more manageable chunks that can be indexed, retrieved, and reasoned over using a ReAct agent to orchestrate these steps. This design allows Cosmos Reason 1 to keep its strengths in short-term physics and spatial reasoning while still being effective over an extended period of time.
Methods
Keeping consistent with this two track approach, each experiment retained its own system architecture within each respective Brev instance.
- Finetuning Path: Improving SLM Tool Calling
We implemented a ReAct-based agent configured to interact with a set of Metropolis VSS-style tools designed to emulate warehouse monitoring analytics APIs. The tool suite included functions such as get_all_sensor_ids, list_incidents, and get_fov_object_counts, each requiring structured arguments and returning deterministic outputs. Tool schemas and response formats were fixed across all experiments to ensure reproducibility.
The primary model evaluated was Qwen2.5-7B-Instruct. For reference, Nemotron-Super-49B and GPT-4.1 were evaluated under the same agent configurations. All models used identical prompts, tool definitions, inference parameters, and agent logic.
An evaluation dataset consisting of natural-language queries paired with true tool-call traces, in which each trace specifies the expected sequence of tool calls and corresponding arguments, was used to benchmark the full agent trajectories from the aforementioned models.
The improvement procedure proceeded in two stages. First, NAT’s optimizer feature was used to tune agent-level hyperparameters and system prompts. This stage targeted improvements in tool selection consistency, argument formatting, and call sequencing.
Second, weight-level adaptation was performed using Agent Reinforcement Trainer (ART). ART applies reinforcement learning over complete agent trajectories, with rewards computed based on alignment between model-generated tool calls and true traces. Training rollouts were generated by using a synthetic task generator that produced diverse natural-language queries paired with known tool-call patterns.
Evaluation metrics included exact-match accuracy over full tool-call trajectories, as well as an efficiency-weighted accuracy metric. For each evaluation query, the dataset defined a set of acceptable tool-call traces with associated efficiency scores in the range [0,1], where a score of 1.0 corresponds to the most efficient valid trajectory. Model outputs that matched a correct but suboptimal trace were counted as correct under exact-match accuracy, while receiving a proportionally reduced score under the efficiency-weighted metric. All evaluations were conducted using the same query set and agent configuration to isolate the effects of optimization and finetuning.
- Physical AI Path: Patching Short-Form Video Bias
Over the course of 16 weeks, we primarily worked with two models:
- Cosmos Reason 1 (7B) is a model that specializes in understanding time, space, and physics, due to training on short form cause-and-effect video content
- Cosmos Embed 1 (448p) is a multimodal embedding model designed to convert video into dense vectors, curated for physical understanding
Cosmos Reason 1 serves as a foundational model within NVIDIA’s Metropolis vision for physical AI – artificial intelligence systems capable of comprehending fundamental physics, spatial relationships, and temporal dynamics to enable common-sense reasoning. The model achieves this capability through fine-tuning on short-form video data, enabling it to infer physical cause-and-effect relationships within brief temporal windows. However, this training methodology introduces a temporal bias that limits performance on long-form video understanding tasks.
To mitigate this limitation, we developed a ReAct-based agentic architecture within NVIDIA AI Toolkit (NAT) that orchestrates three specialized retrieval tools, inspired by Microsoft’s Deep Video Discovery, and reformed for our use:
- Global Tool: During video ingestion, Cosmos Reason 1 performs object tracking across temporal segments to construct a subject registry – a JSON-formatted index recording entry and exit timestamps for tracked entities. At query time, the Global Tool augments this registry with an event registry, leveraging physical reasoning capabilities to identify and temporally localize events involving tracked subjects.
- Clip Tool: Videos are preprocessed into 5-second clips sampled at 2 FPS. These clips are encoded using Cosmos Embed 1 and stored in a FAISS HNSW index for efficient similarity search, while corresponding metadata is maintained in a PostgreSQL database. At inference, the system performs cosine similarity retrieval biased toward identified subjects, extracts relevant clip metadata from PostgreSQL to determine the relevant timestamp of each clip, and generates candidate answers with associated temporal ranges.
- Frame Tool: When the Clip Tool produces insufficient results, the Frame Tool leverages the returned temporal range to extract a uniformly sampled set of 50 frames. These frames are processed by Cosmos Reason 1 for direct visual question answering.
A ReAct agent orchestrates these tools by evaluating query alignment, tool execution history, candidate answers, and tool-reported confidence scores to determine response completeness. Upon convergence, the agent invokes a “finish” tool to return the final answer to the user.
We evaluated this agentic workflow using LongVideoBench, a benchmark designed to assess both long-form video comprehension and multi-faceted reasoning capabilities including spatial, temporal, and event-based understanding. Results are presented below.

Our baseline configuration, Qwen2.5-7B-Instruct with NAT’s default react agent prompt, achieved a raw accuracy of 45.70%. This gap illustrates a common challenge in agentic systems: even when models produce correct outputs, inefficient or unnecessary tool calls reduce their practical utility in enterprise settings.
Introducing optimizations, values found by the optimizer for hyper parameters and the prompt with manually added tool call examples yielded small gains with a 2.87% increase in raw accuracy, suggesting that prompt-level interventions alone are insufficient to meaningfully improve agent reliability.
The most significant improvement within the model came from fine tuning the model. Qwen2.5-7B-Instruct with finetuning achieved 65.70% raw accuracy, representing an almost 20% improvement over the default baseline.

Our agentic RAG architecture demonstrated substantial improvements over the baseline Cosmos Reason 1 model, achieving an overall accuracy increase of 10.23% across the LongVideoBench evaluation suite.
Performance gains were particularly pronounced in reasoning tasks that require long-range temporal understanding and cross-modal association. The system achieved its strongest results in:
- Temporal-to-Action reasoning (+25.83% accuracy)
- Spatial-to-Event reasoning (+26.86% accuracy)
- Single-Scene Spatial retrieval & reasoning (+26.14% accuracy, showcasing the potential of Cosmos Embed 1)
- Temporal-to-Event reasoning (+18.51% accuracy)
These results indicate that our multi-tool architecture effectively compensates for the temporal limitations of short-form video fine-tuning, enabling the system to maintain physical reasoning capabilities while scaling to long-form video understanding tasks. The performance distribution across reasoning categories further suggests that semantic retrieval combined with explicit subject and event tracking creates complementary pathways for temporal understanding in physical AI systems.
Overall, these results validate our hypothesis that through specialization methods such as agent workflows and finetuning, you can create small, computationally inexpensive models which are competitive with their larger counterparts.
Impact
- Physical AI & Robotics Impact
In our work, we show how agentic retrieval can extend long video understanding in smaller reasoning models. Our architecture enables Cosmos Reason 1 to keep its strengths in short-term physics and spatial reasoning while improving its temporal limitations. This approach mimics real-world robotic perception pipelines, where agents need to reason over an extended period of time.
We also contributed a video upload and storage pipeline integration to NVIDIA’s open-source Nemo Agent Toolkit.
This extension addresses a key infrastructure requirement for physical AI and Metropolis-style workflows: persistent access to video assets throughout the agent lifecycle.
The contribution introduces native support for video uploading and storage within NAT and NAT-UI, including:
- A new video library component in NAT UI’s sidebar for uploading and managing video assets
- Storage of uploaded videos in a local S3-compatible object store
This functionality enables video-based agent testing and evaluation, allowing agents to be validated against fixed video datasets. These additions directly benefit the Metropolis and robotics teams at NVIDIA, where developing and validating video-centric agents is critical.
Together, the long-video agent architecture and open-source contributions to NAT provide meaningful impact for physical AI applications.
- Broader Implication for Agent Applications
You don’t need a giant model if you combine the right specialized SLM, the right retrieval system, and the right agent tools.
Our results demonstrate that small, finetuned models can reliably function as enterprise-grade agents. By improving tool-calling accuracy rather than scaling model size, we achieved substantial gains in both correctness and efficiency.
For enterprises, this translates to:
- Lower inference and infrastructure costs
- Reduced latency for real-time analytics
- Improved flexibility for agent hosting.
In enterprise agentic systems, specialization beats scale. A fine-tuned small model that knows how to use tools reliably is often more valuable than a larger model that tries to reason end-to-end, taking more time, money and compute.
Our work with Cosmos Reason demonstrates a fundamental shift in how we approach scaling AI capabilities for domain-specific tasks. Rather than relying solely on parameter count and massive foundation models, we achieved competitive performance on long-form video understanding by orchestrating a specialized small language model with purpose-built retrieval infrastructure and task-specific tooling.
This research represents a scalable pathway for deploying AI across Metropolis applications – from vision tool calling to smart city infrastructure, where efficiency, interpretability, and task-specific performance are keystones.
Clemson x NVIDIA Partnership – Conclusion
As Clemson students, getting to be a part of the partnership between our university and NVIDIA has been nothing short of transformative. Due to the hard work of Carrie Russell from Clemson, as well as Karthick Iyer, Roopa Prabhu, Sujit Biswas, and Zac Wang from NVIDIA, we were given a platform to stand on as a means of chasing our ambitions. For students like us, passionate about pushing the boundaries of artificial intelligence and high-performance computing, the Clemson-NVIDIA collaboration hasn’t just been an opportunity – it’s been a launchpad, and it means the world that we’ve been able to learn and grow while proudly representing the Tiger spirit in the global tech landscape.
A special thanks to everyone involved! We really appreciated all of your support along the way.