Skip to content

docs(devnotes): add Nemotron-Personas dev note#611

Draft
3mei wants to merge 2 commits intomainfrom
yev/nemotron_personas_dev_note
Draft

docs(devnotes): add Nemotron-Personas dev note#611
3mei wants to merge 2 commits intomainfrom
yev/nemotron_personas_dev_note

Conversation

@3mei
Copy link
Copy Markdown
Contributor

@3mei 3mei commented May 7, 2026

📋 Summary

Adds the Inside Nemotron-Personas dev note covering how the multi-locale Nemotron-Personas HF collection is built (4-stage compound-AI pipeline) and how it's used as a seeding primitive across Nemotron training (long-context, tool-use, formal logic, safety refusals, instruction-following). Ships alongside a runnable Tutorial 7 demonstrating reproduction + customization, plus a Colab variant

🔗 Related Issue

N/A

🔄 Changes

✨ Added

  • docs/devnotes/posts/nemotron-personas.md — new dev note
  • docs/devnotes/posts/assets/nemotron-personas/ — four images: three pipeline-stage diagrams from the partner repo plus a black-background Nemotron-Personas world-map hero
  • docs/notebook_source/7-nemotron-personas.py — jupytext source for the Reproducing & Customizing Nemotron-Personas tutorial;
  • docs/colab_notebooks/7-nemotron-personas.ipynb — committed Colab variant; i

🔧 Changed

  • docs/scripts/generate_colab_notebooks.py — adds an ADDITIONAL_SETUP_CELLS map paralleling ADDITIONAL_DEPENDENCIES; injects NGC CLI install + NGC_API_KEY cells. Future devnote-paired tutorials needing extra Colab bootstrap can register one-line entries in the same map.
  • mkdocs.yml — adds Reproducing & Customizing Nemotron-Personas under the Tutorials nav

🧪 Testing

  • make test passes
  • Notebook runs end-to-end via jupytext --to ipynb --execute
  • make generate-colab-notebooks regenerates the Colab .ipynb cleanly with the NGC setup cells in the expected position
  • Unit tests added/updated (N/A — this PR is docs + tutorial assets; no engine code changed)
  • E2E tests added/updated (N/A — Tutorial 7 is opt-in via make convert-execute-notebooks and gated on NVIDIA_API_KEY + on-disk NGC dataset, matching how Tutorials 5/6 are gated on OPENROUTER_API_KEY)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A — no architectural changes)

3mei added 2 commits May 7, 2026 02:04
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Docs preview: https://06744256.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@3mei 3mei changed the title Nemotron-Personas Dev Note docs(devnotes): add Nemotron-Personas dev note May 7, 2026
Copy link
Copy Markdown
Contributor

@danecor danecor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some possible issues / suggestions attached.

from google.colab import userdata

def create_colab_setup_cells(additional_dependencies: str) -> list[NotebookNode]:
try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows up in the notebook as a completely independent block from COLAB_API_KEY_CELL, with duplicated imports. Can we just append the try-except to that block conditionally?

import getpass
import os

from google.colab import userdata

try:
    os.environ["NVIDIA_API_KEY"] = userdata.get("NVIDIA_API_KEY")
except userdata.SecretNotFoundError:
    os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter your NVIDIA API key: ")

...

import getpass
import os

from google.colab import userdata

try:
    os.environ["NGC_API_KEY"] = userdata.get("NGC_API_KEY")
except userdata.SecretNotFoundError:
    os.environ["NGC_API_KEY"] = getpass.getpass("Enter your NGC API key: ")

print(f" - {p.name}")
else:
print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...")
subprocess.run(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails for me even with a populated NGC_API_KEY:

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[5], line 12
      8     for p in existing:
      9         print(f"  - {p.name}")
     10 else:
     11     print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...")
---> 12     subprocess.run(
     13         shlex.split(f"data-designer download personas --locale {personas_locale}"),
     14         check=True,
     15     )

File ~/.local/share/uv/python/cpython-3.11.9-macos-aarch64-none/lib/python3.11/subprocess.py:571, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    569     retcode = process.poll()
    570     if check and retcode:
--> 571         raise CalledProcessError(retcode, process.args,
    572                                  output=stdout, stderr=stderr)
    573 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['data-designer', 'download', 'personas', '--locale', 'en_US']' returned non-zero exit status 1.

A similar issue was raised by codex in review:

Avoid interactive NGC downloads in executed notebooks — The notebook cell depends on data-designer being runnable from the notebook kernel, ngc being on that kernel’s PATH, NGC_API_KEY being in that kernel’s environment, and the interactive “Proceed with download?” prompt being answerable.

My problem was that I didn't have ngc installed locally yet. This might all be fine in colab, if that's the expected or enforced route, but a bit confusing locally given that the error message is uninformative.


# Add specific personal detail columns -- included in the public release
config_builder.add_column(dd.ExpressionColumnConfig(name="sex", expr="{{ person.sex }}"))
config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So colab suggests to me that we need to cast this to age in order to avoid breaking the age > 6 limit checks below, i.e.

config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}", dtype="int"))

But the checks work for me, even though the underlying value appears to be a string. Not sure why, something in jinja? So I guess this cast is optional

# > ⚠️ **Note**: To run this notebook, follow the setup instructions in the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/), make sure you have generated an API key for accessing models on [build.nvidia.com](https://build.nvidia.com), and that you've set the `NVIDIA_API_KEY` environment variable. The next section also walks through downloading the NGC-hosted Nemotron-Personas dataset.
#
# <div align="center">
# <img src="https://raw.githubusercontent.com/NVIDIA-NeMo/DataDesigner/yev/nemotron_personas_dev_note/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd.png" alt="Nemotron Personas pipeline overview" width="600" />
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image link points to the branch - will it update automatically to main when the branch is merged and deleted?


It's easy to underestimate what a really good persona seed buys you. Three angles worth keeping in mind:

1. **Distributional faithfulness for sovereign AI.** Models trained on synthetic data that doesn't reflect the actual demographics of a region inherit subtle biases — over-representing some groups, under-representing others, getting cultural context wrong. For sovereign-AI work, that's not a rounding error; it's the whole problem. Grounding personas in census + administrative data closes that gap before the LLM ever sees the data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but does read a bit LLM-y (em-dashes, "that's not a x; it's a why").


3. **Reusable seed material.** Once a persona has a name, a demographic profile, an OCEAN vector, and a coherent backstory, *any* downstream pipeline can attach to it: a tool-use environment, a long-context construction, a safety-refusal template, a roleplay scenario. The collection acts as a library — generate the personas once, reuse them across training stages.

That last point is the bridge to the rest of this post.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an unnecessary sentence.


## **Nemotron-Personas inside Nemotron training**

The [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) shows just how foundational these personas have become. They're not a side-quest dataset; they're a *seeding primitive* used across many post-training stages.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just say "They're a seeding primitive used across many post-training stages." without the side-quest bit.


A closely related approach was used to build **Nemotron-Nano-9B-v2-Japanese**, NVIDIA's Japanese small language model that ranks **#1 on the Nejumi LLM Leaderboard**. The Japanese instruction-following + general-chat data was seeded by [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan), with prompts and assistant responses anchored to Japanese-grounded personas. That's the multi-locale story turning into a multi-locale model story: a Japanese persona collection, generated by a localized DD pipeline, becomes the seeding layer for a Japanese model that beats the leaderboard.

The same template is being used across the family — instruction-following and general-chat data going into Nemotron Nano v3 (and from there into Super v3) follows the same persona-seeded recipe.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - each place we say "and from there into Super v3" (here and once below) I would just say "and Super v3". Current phrasing sounds a bit like Super v3 inherited from the Nano v3 model, rather than just using the same training dataset.


### Stage 1 — OCEAN Big-Five sampling

OCEAN ([Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits)) is the most empirically grounded model of human personality. For each persona we sample five trait T-scores (\(\mu = 50\), \(\sigma = 10\), clipped to \([20, 80]\)), bucket each into a coarse label, and attach a prose description grounded in the personality literature. Working at the description level (rather than raw scores) is what makes the downstream LLM stages produce nuanced, internally-consistent narratives — "highly conscientious" vs "highly extraverted" reads very differently to an LLM than `t_score=72`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\mu and \sigma don't render as latex for me, might want to insert the actual unicode characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants