The cheap, open giant: running NVIDIA's Nemotron 3 Ultra in Elyra

Every so often a model shows up that's less about topping a leaderboard and more about quietly changing the math of what's affordable. NVIDIA's Nemotron 3 Ultra is one of those. It's open, it's strikingly cheap, and it carries a one-million-token context window. Elyra already speaks to it. So let's talk about what it actually is, where it shines, where it doesn't, and the slightly surprising hardware story sitting behind it.

What Nemotron 3 Ultra actually is

The full name tells you a lot: nvidia/nemotron-3-ultra-550b-a55b.

550B total parameters, 55B active. It's a Mixture-of-Experts model. Think of a hospital with hundreds of specialists where, for any given patient, only the handful you actually need walk into the room. You get the breadth of a 550B model at roughly the running cost of a 55B one.
A reasoning model. It's built for the long loop — plan, call a tool, check itself, keep going — which is exactly the rhythm an agent works in.
One million tokens of context. You can hand it an enormous amount of material in a single pass.
Text only. No vision here. If you need images, reach for MiniMax M3 or a vision-capable Nemotron Nano instead.
Genuinely cheap. Around $0.50 per million input tokens and $2.50 per million output, with free, rate-limited tiers available on a couple of providers.

It's part of a tiered family — Nano, Super, and Ultra — and Ultra is the flagship.

Using it in Elyra

Because the providers are already in Elyra's model registry, there's nothing to install. Set a key and pick the model.

Through OpenRouter:

export OPENROUTER_API_KEY=sk-or-...
elyra

Then, inside the session:

/model

Search for Nemotron 3 Ultra, hit enter, and you're on it. Prefer Together AI? Use TOGETHER_API_KEY and the same model name. Just want to kick the tires for free? There's a :free route on OpenRouter and a free tier via OpenCode — lower limits, zero cost.

The part that makes this feel good in practice is that you don't have to choose a single model for the whole session. Plan with one, grind with another:

> Map how authentication flows through this service. Read everything under src/auth. ... (Elyra reads, traces, explains) /model # switch to Nemotron 3 Ultra

> Now read the entire src/ tree and list every place that touches the session token, then propose a single consolidated helper.

That second step is where the million-token context earns its keep. "Read the entire tree" isn't a parlor trick — it's an agent action, and Ultra can hold the whole thing in its head while it reasons, instead of forgetting the first file by the time it reaches the last.

You can also run it headless for a one-shot over a big pile:

elyra -p "Summarize the architecture and flag risky patterns" 

--provider openrouter 

--model nvidia/nemotron-3-ultra-550b-a55b 

--api-key sk-or-...

And since Elyra already has tiered routing built in, Ultra makes a natural "cheap, big-context workhorse" tier — let the expensive model architect, let Ultra chew through the volume.

The hardware story (and an honest correction)

Here's the strategically interesting bit: NVIDIA gives away the models to sell the silicon. The Nemotron family is open, downloadable, and tuned to run beautifully on NVIDIA's own stack. The models are the hook; the hardware is the business.

That hardware is real and increasingly desk-sized. NVIDIA's DGX Spark is a genuine desktop AI machine — a GB10 Grace Blackwell superchip, 128 GB of unified memory, up to a petaFLOP of FP4 compute, with the AI software stack preinstalled, sold through partners like Dell, Lenovo, HP, Asus, and Acer. It's a little box that runs serious models on your desk, with your data staying on your desk.

But here's the honest part, because we'd rather be useful than breathless: Nemotron 3 Ultra does not run on a DGX Spark. The numbers are public. A single Spark handles inference for models up to ~200B parameters; two of them linked together reach ~405B. Ultra is 550B. It's simply too big for the desktop — Ultra lives in the cloud or on data-center GPUs.

What does run locally is the rest of the family. The smaller Nemotron Nano and Super tiers fit comfortably on a DGX Spark or an RTX machine. So the open-weights, runs-on-your-hardware promise is true — just not for the 550B flagship.

A clean way to think about it

That gives you a tidy two-lane setup, both reachable from the same Elyra session:

Cloud lane — Nemotron 3 Ultra. A million tokens of context for pennies, perfect for large-scale reading, analysis, and reasoning over big inputs. Switch to it with /model whenever the job gets heavy.
Local lane — Nemotron Nano/Super. Stand it up on a DGX Spark or RTX box, expose an OpenAI-compatible endpoint, and point Elyra at it as a custom provider. Nothing leaves your network — the right answer for regulated work in law, healthcare, or finance.

Same agent, same commands, two very different cost-and-privacy profiles.

Where it falls short

No rose-tinted glasses. On the hardest single-shot reasoning, the top closed models — and our own preferred Claude — are still ahead. Ultra is text-only, so it's not your pick for anything visual. And while a million-token context is wonderful, holding information isn't the same as flawlessly reasoning over all of it, so keep checking the work on high-stakes output. Use Ultra where it's genuinely strong: big context, low cost, long agent loops.

The takeaway

Nemotron 3 Ultra isn't here to win the demo. It's here to do a lot of work, over a lot of context, for almost no money — and Elyra can drive it today, alongside a top-tier model, with a single /model switch. Open weights mean you can even keep the smaller siblings entirely in-house on NVIDIA hardware when privacy matters.

The flashiest model gets the applause. The cheap one that quietly handles the volume tends to get the job. It's nice when your agent can reach for both.

Happy building.