<p>Every so often a model shows up that's less about topping a leaderboard and more about quietly changing the math of what's affordable. NVIDIA's <strong>Nemotron 3 Ultra</strong> is one of those. It's open, it's strikingly cheap, and it carries a <strong>one-million-token</strong> context window. Elyra already speaks to it. So let's talk about what it actually is, where it shines, where it doesn't, and the slightly surprising hardware story sitting behind it.</p><h2>What Nemotron 3 Ultra actually is</h2><p>The full name tells you a lot: <code>nvidia/nemotron-3-ultra-550b-a55b</code>.</p><ul><li><p><strong>550B total parameters, 55B active.</strong> It's a Mixture-of-Experts model. Think of a hospital with hundreds of specialists where, for any given patient, only the handful you actually need walk into the room. You get the breadth of a 550B model at roughly the running cost of a 55B one.</p></li><li><p><strong>A reasoning model.</strong> It's built for the long loop — plan, call a tool, check itself, keep going — which is exactly the rhythm an agent works in.</p></li><li><p><strong>One million tokens of context.</strong> You can hand it an enormous amount of material in a single pass.</p></li><li><p><strong>Text only.</strong> No vision here. If you need images, reach for MiniMax M3 or a vision-capable Nemotron Nano instead.</p></li><li><p><strong>Genuinely cheap.</strong> Around <strong>$0.50 per million input tokens</strong> and <strong>$2.50 per million output</strong>, with free, rate-limited tiers available on a couple of providers.</p></li></ul><p>It's part of a tiered family — Nano, Super, and Ultra — and Ultra is the flagship.</p><h2>Using it in Elyra</h2><p>Because the providers are already in Elyra's model registry, there's nothing to install. Set a key and pick the model.</p><p>Through OpenRouter:</p><pre><code class="language-bash">export OPENROUTER_API_KEY=sk-or-...
elyra
</code></pre><p>Then, inside the session:</p><pre><code>/model
</code></pre><p>Search for <code>Nemotron 3 Ultra</code>, hit enter, and you're on it. Prefer Together AI? Use <code>TOGETHER_API_KEY</code> and the same model name. Just want to kick the tires for free? There's a <code>:free</code> route on OpenRouter and a free tier via OpenCode — lower limits, zero cost.</p><p>The part that makes this feel good in practice is that you don't have to choose a single model for the whole session. Plan with one, grind with another:</p><pre><code>&gt; Map how authentication flows through this service. Read everything under src/auth.
  ... (Elyra reads, traces, explains)

/model            # switch to Nemotron 3 Ultra

&gt; Now read the entire src/ tree and list every place that touches the session token,
  then propose a single consolidated helper.
</code></pre><p>That second step is where the million-token context earns its keep. "Read the entire tree" isn't a parlor trick — it's an agent action, and Ultra can hold the whole thing in its head while it reasons, instead of forgetting the first file by the time it reaches the last.</p><p>You can also run it headless for a one-shot over a big pile:</p><pre><code class="language-bash">elyra -p "Summarize the architecture and flag risky patterns" \
  --provider openrouter \
  --model nvidia/nemotron-3-ultra-550b-a55b \
  --api-key sk-or-...
</code></pre><p>And since Elyra already has tiered routing built in, Ultra makes a natural "cheap, big-context workhorse" tier — let the expensive model architect, let Ultra chew through the volume.</p><h2>The hardware story (and an honest correction)</h2><p>Here's the strategically interesting bit: <strong>NVIDIA gives away the models to sell the silicon.</strong> The Nemotron family is open, downloadable, and tuned to run beautifully on NVIDIA's own stack. The models are the hook; the hardware is the business.</p><p>That hardware is real and increasingly desk-sized. NVIDIA's <strong>DGX Spark</strong> is a genuine desktop AI machine — a GB10 Grace Blackwell superchip, <strong>128 GB</strong> of unified memory, up to a petaFLOP of FP4 compute, with the AI software stack preinstalled, sold through partners like Dell, Lenovo, HP, Asus, and Acer. It's a little box that runs serious models on your desk, with your data staying on your desk.</p><p>But here's the honest part, because we'd rather be useful than breathless: <strong>Nemotron 3 Ultra does not run on a DGX Spark.</strong> The numbers are public. A single Spark handles inference for models up to ~200B parameters; two of them linked together reach ~405B. Ultra is 550B. It's simply too big for the desktop — Ultra lives in the cloud or on data-center GPUs.</p><p>What <em>does</em> run locally is the rest of the family. The smaller <strong>Nemotron Nano and Super</strong> tiers fit comfortably on a DGX Spark or an RTX machine. So the open-weights, runs-on-your-hardware promise is true — just not for the 550B flagship.</p><h2>A clean way to think about it</h2><p>That gives you a tidy two-lane setup, both reachable from the same Elyra session:</p><ul><li><p><strong>Cloud lane — Nemotron 3 Ultra.</strong> A million tokens of context for pennies, perfect for large-scale reading, analysis, and reasoning over big inputs. Switch to it with <code>/model</code> whenever the job gets heavy.</p></li><li><p><strong>Local lane — Nemotron Nano/Super.</strong> Stand it up on a DGX Spark or RTX box, expose an OpenAI-compatible endpoint, and point Elyra at it as a custom provider. Nothing leaves your network — the right answer for regulated work in law, healthcare, or finance.</p></li></ul><p>Same agent, same commands, two very different cost-and-privacy profiles.</p><h2>Where it falls short</h2><p>No rose-tinted glasses. On the hardest single-shot reasoning, the top closed models — and our own preferred Claude — are still ahead. Ultra is text-only, so it's not your pick for anything visual. And while a million-token context is wonderful, holding information isn't the same as flawlessly reasoning over all of it, so keep checking the work on high-stakes output. Use Ultra where it's genuinely strong: big context, low cost, long agent loops.</p><h2>The takeaway</h2><p>Nemotron 3 Ultra isn't here to win the demo. It's here to do a lot of work, over a lot of context, for almost no money — and Elyra can drive it today, alongside a top-tier model, with a single <code>/model</code> switch. Open weights mean you can even keep the smaller siblings entirely in-house on NVIDIA hardware when privacy matters.</p><p>The flashiest model gets the applause. The cheap one that quietly handles the volume tends to get the job. It's nice when your agent can reach for both.</p><p>Happy building.</p>