docs(devnotes): add Nemotron-Personas dev note by 3mei · Pull Request #611 · NVIDIA-NeMo/DataDesigner

3mei · 2026-05-07T06:30:31Z

📋 Summary

Adds the Inside Nemotron-Personas dev note covering how the multi-locale Nemotron-Personas HF collection is built (4-stage compound-AI pipeline) and how it's used as a seeding primitive across Nemotron training (long-context, tool-use, formal logic, safety refusals, instruction-following). Ships alongside a runnable Tutorial 7 demonstrating reproduction + customization, plus a Colab variant

🔗 Related Issue

N/A

🔄 Changes

✨ Added

docs/devnotes/posts/nemotron-personas.md — new dev note
docs/devnotes/posts/assets/nemotron-personas/ — four images: three pipeline-stage diagrams from the partner repo plus a black-background Nemotron-Personas world-map hero
docs/notebook_source/7-nemotron-personas.py — jupytext source for the Reproducing & Customizing Nemotron-Personas tutorial;
docs/colab_notebooks/7-nemotron-personas.ipynb — committed Colab variant; i

🔧 Changed

docs/scripts/generate_colab_notebooks.py — adds an ADDITIONAL_SETUP_CELLS map paralleling ADDITIONAL_DEPENDENCIES; injects NGC CLI install + NGC_API_KEY cells. Future devnote-paired tutorials needing extra Colab bootstrap can register one-line entries in the same map.
mkdocs.yml — adds Reproducing & Customizing Nemotron-Personas under the Tutorials nav

🧪 Testing

make test passes
Notebook runs end-to-end via jupytext --to ipynb --execute
make generate-colab-notebooks regenerates the Colab .ipynb cleanly with the NGC setup cells in the expected position
Unit tests added/updated (N/A — this PR is docs + tutorial assets; no engine code changed)
E2E tests added/updated (N/A — Tutorial 7 is opt-in via make convert-execute-notebooks and gated on NVIDIA_API_KEY + on-disk NGC dataset, matching how Tutorials 5/6 are gated on OPENROUTER_API_KEY)

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (N/A — no architectural changes)

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>

github-actions · 2026-05-07T06:31:49Z

Docs preview: https://06744256.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

danecor

Looks good! Some possible issues / suggestions attached.

danecor · 2026-05-09T15:03:36Z

+from google.colab import userdata

-def create_colab_setup_cells(additional_dependencies: str) -> list[NotebookNode]:
+try:


This shows up in the notebook as a completely independent block from COLAB_API_KEY_CELL, with duplicated imports. Can we just append the try-except to that block conditionally?

import getpass import os from google.colab import userdata try: os.environ["NVIDIA_API_KEY"] = userdata.get("NVIDIA_API_KEY") except userdata.SecretNotFoundError: os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter your NVIDIA API key: ")

...

import getpass import os from google.colab import userdata try: os.environ["NGC_API_KEY"] = userdata.get("NGC_API_KEY") except userdata.SecretNotFoundError: os.environ["NGC_API_KEY"] = getpass.getpass("Enter your NGC API key: ")

danecor · 2026-05-09T15:24:04Z

+        print(f"  - {p.name}")
+else:
+    print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...")
+    subprocess.run(


This fails for me even with a populated NGC_API_KEY:

--------------------------------------------------------------------------- CalledProcessError Traceback (most recent call last) Cell In[5], line 12 8 for p in existing: 9 print(f" - {p.name}") 10 else: 11 print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...") ---> 12 subprocess.run( 13 shlex.split(f"data-designer download personas --locale {personas_locale}"), 14 check=True, 15 ) File ~/.local/share/uv/python/cpython-3.11.9-macos-aarch64-none/lib/python3.11/subprocess.py:571, in run(input, capture_output, timeout, check, *popenargs, **kwargs) 569 retcode = process.poll() 570 if check and retcode: --> 571 raise CalledProcessError(retcode, process.args, 572 output=stdout, stderr=stderr) 573 return CompletedProcess(process.args, retcode, stdout, stderr) CalledProcessError: Command '['data-designer', 'download', 'personas', '--locale', 'en_US']' returned non-zero exit status 1.

A similar issue was raised by codex in review:

Avoid interactive NGC downloads in executed notebooks — The notebook cell depends on data-designer being runnable from the notebook kernel, ngc being on that kernel’s PATH, NGC_API_KEY being in that kernel’s environment, and the interactive “Proceed with download?” prompt being answerable.

My problem was that I didn't have ngc installed locally yet. This might all be fine in colab, if that's the expected or enforced route, but a bit confusing locally given that the error message is uninformative.

danecor · 2026-05-09T15:40:34Z

+
+# Add specific personal detail columns -- included in the public release
+config_builder.add_column(dd.ExpressionColumnConfig(name="sex", expr="{{ person.sex }}"))
+config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}"))


So colab suggests to me that we need to cast this to age in order to avoid breaking the age > 6 limit checks below, i.e.

config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}", dtype="int"))

But the checks work for me, even though the underlying value appears to be a string. Not sure why, something in jinja? So I guess this cast is optional

danecor · 2026-05-09T15:48:29Z

+# > ⚠️ **Note**: To run this notebook, follow the setup instructions in the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/), make sure you have generated an API key for accessing models on [build.nvidia.com](https://build.nvidia.com), and that you've set the `NVIDIA_API_KEY` environment variable. The next section also walks through downloading the NGC-hosted Nemotron-Personas dataset.
+#
+# <div align="center">
+#   <img src="https://raw.githubusercontent.com/NVIDIA-NeMo/DataDesigner/yev/nemotron_personas_dev_note/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd.png" alt="Nemotron Personas pipeline overview" width="600" />


This image link points to the branch - will it update automatically to main when the branch is merged and deleted?

danecor · 2026-05-09T15:53:36Z

+
+It's easy to underestimate what a really good persona seed buys you. Three angles worth keeping in mind:
+
+1. **Distributional faithfulness for sovereign AI.** Models trained on synthetic data that doesn't reflect the actual demographics of a region inherit subtle biases — over-representing some groups, under-representing others, getting cultural context wrong. For sovereign-AI work, that's not a rounding error; it's the whole problem. Grounding personas in census + administrative data closes that gap before the LLM ever sees the data.


This is fine, but does read a bit LLM-y (em-dashes, "that's not a x; it's a why").

danecor · 2026-05-09T15:54:56Z

+
+3. **Reusable seed material.** Once a persona has a name, a demographic profile, an OCEAN vector, and a coherent backstory, *any* downstream pipeline can attach to it: a tool-use environment, a long-context construction, a safety-refusal template, a roleplay scenario. The collection acts as a library — generate the personas once, reuse them across training stages.
+
+That last point is the bridge to the rest of this post.


Seems like an unnecessary sentence.

danecor · 2026-05-09T15:56:10Z

+
+## **Nemotron-Personas inside Nemotron training**
+
+The [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) shows just how foundational these personas have become. They're not a side-quest dataset; they're a *seeding primitive* used across many post-training stages.


I would just say "They're a seeding primitive used across many post-training stages." without the side-quest bit.

danecor · 2026-05-09T16:00:57Z

+
+A closely related approach was used to build **Nemotron-Nano-9B-v2-Japanese**, NVIDIA's Japanese small language model that ranks **#1 on the Nejumi LLM Leaderboard**. The Japanese instruction-following + general-chat data was seeded by [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan), with prompts and assistant responses anchored to Japanese-grounded personas. That's the multi-locale story turning into a multi-locale model story: a Japanese persona collection, generated by a localized DD pipeline, becomes the seeding layer for a Japanese model that beats the leaderboard.
+
+The same template is being used across the family — instruction-following and general-chat data going into Nemotron Nano v3 (and from there into Super v3) follows the same persona-seeded recipe.


Nitpick - each place we say "and from there into Super v3" (here and once below) I would just say "and Super v3". Current phrasing sounds a bit like Super v3 inherited from the Nano v3 model, rather than just using the same training dataset.

danecor · 2026-05-09T16:03:07Z

+
+### Stage 1 — OCEAN Big-Five sampling
+
+OCEAN ([Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits)) is the most empirically grounded model of human personality. For each persona we sample five trait T-scores (\(\mu = 50\), \(\sigma = 10\), clipped to \([20, 80]\)), bucket each into a coarse label, and attach a prose description grounded in the personality literature. Working at the description level (rather than raw scores) is what makes the downstream LLM stages produce nuanced, internally-consistent narratives — "highly conscientious" vs "highly extraverted" reads very differently to an LLM than `t_score=72`.


\mu and \sigma don't render as latex for me, might want to insert the actual unicode characters.

3mei added 2 commits May 7, 2026 02:04

udpate map

60a995b

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>

docs(devnotes): add Nemotron-Personas colab notebook

8f27a19

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>

3mei changed the title ~~Nemotron-Personas Dev Note~~ docs(devnotes): add Nemotron-Personas dev note May 7, 2026

danecor requested changes May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(devnotes): add Nemotron-Personas dev note#611

docs(devnotes): add Nemotron-Personas dev note#611
3mei wants to merge 2 commits intomainfrom
yev/nemotron_personas_dev_note

3mei commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

danecor left a comment

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

danecor May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		It's easy to underestimate what a really good persona seed buys you. Three angles worth keeping in mind:

		1. Distributional faithfulness for sovereign AI. Models trained on synthetic data that doesn't reflect the actual demographics of a region inherit subtle biases — over-representing some groups, under-representing others, getting cultural context wrong. For sovereign-AI work, that's not a rounding error; it's the whole problem. Grounding personas in census + administrative data closes that gap before the LLM ever sees the data.


		3. Reusable seed material. Once a persona has a name, a demographic profile, an OCEAN vector, and a coherent backstory, any downstream pipeline can attach to it: a tool-use environment, a long-context construction, a safety-refusal template, a roleplay scenario. The collection acts as a library — generate the personas once, reuse them across training stages.

		That last point is the bridge to the rest of this post.


		## Nemotron-Personas inside Nemotron training

		The [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) shows just how foundational these personas have become. They're not a side-quest dataset; they're a seeding primitive used across many post-training stages.


		A closely related approach was used to build Nemotron-Nano-9B-v2-Japanese, NVIDIA's Japanese small language model that ranks #1 on the Nejumi LLM Leaderboard. The Japanese instruction-following + general-chat data was seeded by [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan), with prompts and assistant responses anchored to Japanese-grounded personas. That's the multi-locale story turning into a multi-locale model story: a Japanese persona collection, generated by a localized DD pipeline, becomes the seeding layer for a Japanese model that beats the leaderboard.

		The same template is being used across the family — instruction-following and general-chat data going into Nemotron Nano v3 (and from there into Super v3) follows the same persona-seeded recipe.


		### Stage 1 — OCEAN Big-Five sampling

		OCEAN ([Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits)) is the most empirically grounded model of human personality. For each persona we sample five trait T-scores (\(\mu = 50\), \(\sigma = 10\), clipped to \([20, 80]\)), bucket each into a coarse label, and attach a prose description grounded in the personality literature. Working at the description level (rather than raw scores) is what makes the downstream LLM stages produce nuanced, internally-consistent narratives — "highly conscientious" vs "highly extraverted" reads very differently to an LLM than `t_score=72`.

Conversation

3mei commented May 7, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

✨ Added

🔧 Changed

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

danecor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants