Skip to main content
How to Run Your First AI Models on the NVIDIA DGX Spark

How to Run Your First AI Models on the NVIDIA DGX Spark

A practical walkthrough for getting a 120B-parameter model running on NVIDIA's $3,000 desktop AI box, plus the unified-memory gotcha most guides miss.

A

AnIntent Editorial

10 min read

Photo by Mariia Shalabaieva on Unsplash

Your first working model on the NVIDIA DGX Spark setup can be live in under an hour, and it does not need a single line of CUDA code. The fastest path is a containerized inference server pointed at a Hugging Face checkpoint, running on the same Grace Blackwell silicon NVIDIA ships in its data-center DGX systems. This walkthrough covers what that path actually looks like, where the unified memory architecture trips people up, and which playbook to start with depending on the model size.

The device you are configuring was announced as Project DIGITS at CES 2025 and later rebranded. According to NVIDIA's Project DIGITS announcement, the GB10 Superchip pairs a Blackwell GPU with an NVIDIA Grace CPU via NVLink-C2C chip-to-chip interconnect, and the desktop unit ships with 128GB of unified system memory and DGX OS preinstalled. That memory pool is the entire reason this box exists. Treat it as the constraint that shapes every decision below.

What Your DGX Spark Can Actually Run on Day One

A single unit handles models up to 200 billion parameters at FP4 precision, and two linked units extend that ceiling further. NVIDIA's product page states that the Spark supports AI models with up to 200 billion parameters on a single unit, and two DGX Spark systems can be linked via NVIDIA ConnectX networking to run models of up to 405 billion parameters. The compute target NVIDIA quotes is up to 1 petaFLOP at FP4 precision, the same Grace Blackwell architecture used in enterprise DGX systems.

For a first run, ignore the 200B headline. Start at 20B parameters and prove the stack works before you commit to a 65GB download.

A realistic shortlist for the first session:

  • gpt-oss-20b for fast iteration and snappy chat latency
  • Llama 3 8B for fine-tuning experiments via LoRA
  • DeepSeek-V2-Lite if you want to test SGLang's serving stack
  • gpt-oss-120b once you trust your storage and cooling, around 65GB on disk

The NVIDIA Spark playbook for OpenClaw notes that gpt-oss-120b is roughly 65GB and may take longer on slower connections, while smaller variants like gpt-oss-20b run with plenty of memory headroom. That 65GB figure matters because it sets your minimum viable internet plan as much as your storage plan.

Get the Box on the Network and Verify the GPU Stack

Plug the Spark into a standard outlet, attach a monitor over the included display output, and connect Ethernet. eWeek reports that the system draws power from a standard electrical outlet, with no special 240V wiring or data-center power infrastructure required, which is why this hardware fits on a normal desk rather than in a server closet.

First-boot brings you into DGX OS, NVIDIA's Linux distribution tuned for the GB10 platform. Open a terminal and confirm the GPU is visible before you do anything else:

nvcc --version    # expect CUDA 13.0
nvidia-smi        # expect GB10 listed with 128GB unified memory

The vLLM playbook in NVIDIA's official repository lists CUDA 13.0 and Docker as the baseline prerequisites for any container-based serving on Spark, including the ARM64 Blackwell target. If nvidia-smi does not return a populated table, stop. Nothing downstream will work until that command shows a GPU and a driver version.

Clone the playbook repo while you are in the terminal. The official NVIDIA DGX Spark Playbooks repository is a curated collection of step-by-step recipes covering Comfy UI, llama.cpp, LM Studio, JAX, vLLM, SGLang, Isaac Sim, and multi-Spark networking. It is the single most useful thing on the device after the OS itself.

git clone https://github.com/NVIDIA/dgx-spark-playbooks

Pick Your Serving Engine Before You Pick Your Model

The choice between LM Studio, Ollama, vLLM, and SGLang determines almost every other decision you will make this week. Each has a different ceiling.

LM Studio is the gentlest entry point. The DGX Spark hub on build.nvidia.com describes deploying LM Studio to serve LLMs on a Spark device, with a companion LM Link feature for accessing those models from another machine. Pick it if you want a chat window in under ten minutes.

vLLM is the production answer. Its core advantage is PagedAttention, a memory-efficient attention algorithm that handles long sequences without running out of GPU memory, with continuous batching and an OpenAI-compatible API. If you plan to put a real application behind your model, start here.

SGLang is the throughput-first option. The NVIDIA SGLang playbook describes it as a fast serving framework for LLMs and vision-language models and ships an optimized NGC container preconfigured for the Blackwell target. Reach for this when you have moved past single-user chat and want to benchmark concurrency.

For a DGX Spark first AI model tutorial, vLLM hits the sweet spot. The container is prebuilt, the API matches anything you have already written for OpenAI, and the surface area for misconfiguration is small.

The Quickest Path to a Live Inference Endpoint

Following the vLLM playbook prerequisites, pull the NVIDIA-built container and launch it against a model. The exact image tag updates over time; always check the playbook before pasting commands.

export HF_TOKEN="<your_huggingface_token>"

docker run --rm -it --gpus all \
  -e HF_TOKEN="$HF_TOKEN" \
  -p 8000:8000 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:<latest> \
  --model openai/gpt-oss-20b

Mount the Hugging Face cache to a host volume on the first run. Re-pulling a 40GB checkpoint because you destroyed an ephemeral container is the most common time-waster on this hardware.

Once the server reports Application startup complete, hit it from another shell:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-oss-20b","messages":[{"role":"user","content":"Say hi."}]}'

That is your first model. The same endpoint works from any OpenAI client library by changing the base URL. This is also exactly how to run LLM locally on DGX Spark for application development: keep the OpenAI shape, swap the host.

The Unified Memory Trap Almost Nobody Warns You About

DGX Spark does not behave like a discrete GPU rig, and the assumption that it does will burn an afternoon. The SGLang playbook says it directly: DGX Spark uses a Unified Memory Architecture that enables dynamic memory sharing between the GPU and CPU, and with many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark.

In plain terms: a framework written for a 24GB RTX card may treat the Spark's 128GB pool as if it were still partitioned into discrete VRAM and system RAM. Out-of-memory errors can appear at allocations that look small relative to the box's total capacity. The fix is rarely "buy more memory." It is almost always a smaller context window, a smaller model, or a build of the framework that understands UMA.

The OpenClaw playbook gives the practical recipe for OOM: try a smaller context such as 16384, switch to a smaller model like gpt-oss-20b, and monitor memory with nvidia-smi while the model is loaded. Run nvidia-smi -l 1 in a second terminal during your first heavy load. Watch the pool fill in real time. You will learn more about how UMA behaves on this hardware in five minutes than any spec sheet will tell you.

This is the gap between Spark and a workstation with a single high-end consumer card. eWeek notes that a PC with an RTX 5090 can overlap on many local LLM tasks at a similar price point, making the value proposition strongest for users who need the full unified-memory architecture at 128GB rather than GPU VRAM limits. If your workload fits in 32GB of VRAM, the Spark is the wrong tool. If it does not, nothing else at this price runs it on a desk.

Move From Inference to Fine-Tuning Without Buying More Hardware

The NVIDIA Project DIGITS setup guide path used to stop at inference. It does not anymore. The same 128GB pool that loads a 120B model also fits a parameter-efficient fine-tuning run on smaller checkpoints.

The official benchmarking guide in the playbooks repo documents two reference workloads: LoRA fine-tuning for parameter-efficient adaptation of Llama 3 8B, and qLoRA fine-tuning for memory-efficient fine-tuning of Llama 3 70B. Eight billion parameters with full LoRA and seventy billion parameters with quantized LoRA on a single desktop unit is the kind of envelope that previously required a rented A100.

For faster iteration, the Unsloth playbook wires up an optimized fine-tuning stack. NVIDIA's documentation states that Unsloth on Spark targets up to 2x faster training speeds with reduced memory usage through parameter-efficient methods like LoRA and QLoRA. For a first fine-tune, this is the path. The boilerplate is short and the failure modes are well-documented.

The broader software stack is the same one NVIDIA describes for the Spark: PyTorch, Python, Jupyter Notebooks, the NeMo framework for fine-tuning, and the RAPIDS libraries for data science. If you have used any of those on a cloud DGX, the imports do not change.

A second unit doubles the parameter ceiling and the cost. eWeek reports that the $3,000 entry price for a single unit doubles to $6,000+ for the dual-unit 405B-parameter configuration. That is not a small commitment for a hobbyist, and it is the cost barrier most reviewers flagged at launch.

The wiring is straightforward but not trivial. The two-Spark playbook describes high-bandwidth, low-latency interconnects over QSFP ports, with bandwidth validated at both the NCCL collective layer and at the raw RDMA layer over RoCE on CX-7 NICs. Plan an afternoon for the network configuration and another for distributed inference debugging.

The honest test for whether you need two: open nvidia-smi on a single Spark while running your largest realistic workload at the longest realistic context. If memory pressure is comfortable, a second unit buys you nothing but parameter count. If you are pinning the pool, the second unit is the only path that does not involve cloud bills. For background on why local capacity is a moving target right now, our coverage of AI infrastructure tracks the trade-offs in detail.

What to Build First, and Where the Cloud Path Goes

The single highest-leverage first project is a private retrieval-augmented chat against your own documents using a 20B model and vLLM. It exercises the full stack, fits in memory with comfortable headroom, and produces something you will actually use. From there, swap in a 120B model only after the 20B version is stable.

The deployment story is what makes this device different from any other desktop. NVIDIA states that developers can prototype on DGX Spark running DGX OS, then deploy to NVIDIA DGX Cloud or other accelerated cloud infrastructure using the same architecture. A model that runs on your desk runs unmodified on a rented Grace Blackwell node. That continuity is the real product. For more on how local hardware is reshaping AI development workflows, see our Personal AI Hardware and Developer Tools coverage.

Next step: open the playbooks repo, pick the framework that matches your workload, and run the example end-to-end before you customize anything.

Frequently Asked Questions

More from AnIntent

Keep reading

All articles