Conversation
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
|
Docs preview: https://06744256.dd-docs-preview.pages.dev
|
danecor
left a comment
There was a problem hiding this comment.
Looks good! Some possible issues / suggestions attached.
| from google.colab import userdata | ||
|
|
||
| def create_colab_setup_cells(additional_dependencies: str) -> list[NotebookNode]: | ||
| try: |
There was a problem hiding this comment.
This shows up in the notebook as a completely independent block from COLAB_API_KEY_CELL, with duplicated imports. Can we just append the try-except to that block conditionally?
import getpass
import os
from google.colab import userdata
try:
os.environ["NVIDIA_API_KEY"] = userdata.get("NVIDIA_API_KEY")
except userdata.SecretNotFoundError:
os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter your NVIDIA API key: ")
...
import getpass
import os
from google.colab import userdata
try:
os.environ["NGC_API_KEY"] = userdata.get("NGC_API_KEY")
except userdata.SecretNotFoundError:
os.environ["NGC_API_KEY"] = getpass.getpass("Enter your NGC API key: ")
| print(f" - {p.name}") | ||
| else: | ||
| print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...") | ||
| subprocess.run( |
There was a problem hiding this comment.
This fails for me even with a populated NGC_API_KEY:
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
Cell In[5], line 12
8 for p in existing:
9 print(f" - {p.name}")
10 else:
11 print(f"Nemotron-Personas-{personas_locale} not found. Downloading via the Data Designer CLI...")
---> 12 subprocess.run(
13 shlex.split(f"data-designer download personas --locale {personas_locale}"),
14 check=True,
15 )
File ~/.local/share/uv/python/cpython-3.11.9-macos-aarch64-none/lib/python3.11/subprocess.py:571, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
569 retcode = process.poll()
570 if check and retcode:
--> 571 raise CalledProcessError(retcode, process.args,
572 output=stdout, stderr=stderr)
573 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['data-designer', 'download', 'personas', '--locale', 'en_US']' returned non-zero exit status 1.
A similar issue was raised by codex in review:
Avoid interactive NGC downloads in executed notebooks — The notebook cell depends on data-designer being runnable from the notebook kernel, ngc being on that kernel’s PATH, NGC_API_KEY being in that kernel’s environment, and the interactive “Proceed with download?” prompt being answerable.
My problem was that I didn't have ngc installed locally yet. This might all be fine in colab, if that's the expected or enforced route, but a bit confusing locally given that the error message is uninformative.
|
|
||
| # Add specific personal detail columns -- included in the public release | ||
| config_builder.add_column(dd.ExpressionColumnConfig(name="sex", expr="{{ person.sex }}")) | ||
| config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}")) |
There was a problem hiding this comment.
So colab suggests to me that we need to cast this to age in order to avoid breaking the age > 6 limit checks below, i.e.
config_builder.add_column(dd.ExpressionColumnConfig(name="age", expr="{{ person.age }}", dtype="int"))
But the checks work for me, even though the underlying value appears to be a string. Not sure why, something in jinja? So I guess this cast is optional
| # > ⚠️ **Note**: To run this notebook, follow the setup instructions in the [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/), make sure you have generated an API key for accessing models on [build.nvidia.com](https://build.nvidia.com), and that you've set the `NVIDIA_API_KEY` environment variable. The next section also walks through downloading the NGC-hosted Nemotron-Personas dataset. | ||
| # | ||
| # <div align="center"> | ||
| # <img src="https://raw.githubusercontent.com/NVIDIA-NeMo/DataDesigner/yev/nemotron_personas_dev_note/docs/devnotes/posts/assets/nemotron-personas/nemotron_persona_via_ndd.png" alt="Nemotron Personas pipeline overview" width="600" /> |
There was a problem hiding this comment.
This image link points to the branch - will it update automatically to main when the branch is merged and deleted?
|
|
||
| It's easy to underestimate what a really good persona seed buys you. Three angles worth keeping in mind: | ||
|
|
||
| 1. **Distributional faithfulness for sovereign AI.** Models trained on synthetic data that doesn't reflect the actual demographics of a region inherit subtle biases — over-representing some groups, under-representing others, getting cultural context wrong. For sovereign-AI work, that's not a rounding error; it's the whole problem. Grounding personas in census + administrative data closes that gap before the LLM ever sees the data. |
There was a problem hiding this comment.
This is fine, but does read a bit LLM-y (em-dashes, "that's not a x; it's a why").
|
|
||
| 3. **Reusable seed material.** Once a persona has a name, a demographic profile, an OCEAN vector, and a coherent backstory, *any* downstream pipeline can attach to it: a tool-use environment, a long-context construction, a safety-refusal template, a roleplay scenario. The collection acts as a library — generate the personas once, reuse them across training stages. | ||
|
|
||
| That last point is the bridge to the rest of this post. |
There was a problem hiding this comment.
Seems like an unnecessary sentence.
|
|
||
| ## **Nemotron-Personas inside Nemotron training** | ||
|
|
||
| The [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) shows just how foundational these personas have become. They're not a side-quest dataset; they're a *seeding primitive* used across many post-training stages. |
There was a problem hiding this comment.
I would just say "They're a seeding primitive used across many post-training stages." without the side-quest bit.
|
|
||
| A closely related approach was used to build **Nemotron-Nano-9B-v2-Japanese**, NVIDIA's Japanese small language model that ranks **#1 on the Nejumi LLM Leaderboard**. The Japanese instruction-following + general-chat data was seeded by [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan), with prompts and assistant responses anchored to Japanese-grounded personas. That's the multi-locale story turning into a multi-locale model story: a Japanese persona collection, generated by a localized DD pipeline, becomes the seeding layer for a Japanese model that beats the leaderboard. | ||
|
|
||
| The same template is being used across the family — instruction-following and general-chat data going into Nemotron Nano v3 (and from there into Super v3) follows the same persona-seeded recipe. |
There was a problem hiding this comment.
Nitpick - each place we say "and from there into Super v3" (here and once below) I would just say "and Super v3". Current phrasing sounds a bit like Super v3 inherited from the Nano v3 model, rather than just using the same training dataset.
|
|
||
| ### Stage 1 — OCEAN Big-Five sampling | ||
|
|
||
| OCEAN ([Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits)) is the most empirically grounded model of human personality. For each persona we sample five trait T-scores (\(\mu = 50\), \(\sigma = 10\), clipped to \([20, 80]\)), bucket each into a coarse label, and attach a prose description grounded in the personality literature. Working at the description level (rather than raw scores) is what makes the downstream LLM stages produce nuanced, internally-consistent narratives — "highly conscientious" vs "highly extraverted" reads very differently to an LLM than `t_score=72`. |
There was a problem hiding this comment.
\mu and \sigma don't render as latex for me, might want to insert the actual unicode characters.
📋 Summary
Adds the Inside Nemotron-Personas dev note covering how the multi-locale Nemotron-Personas HF collection is built (4-stage compound-AI pipeline) and how it's used as a seeding primitive across Nemotron training (long-context, tool-use, formal logic, safety refusals, instruction-following). Ships alongside a runnable Tutorial 7 demonstrating reproduction + customization, plus a Colab variant
🔗 Related Issue
N/A
🔄 Changes
✨ Added
docs/devnotes/posts/nemotron-personas.md— new dev notedocs/devnotes/posts/assets/nemotron-personas/— four images: three pipeline-stage diagrams from the partner repo plus a black-backgroundNemotron-Personasworld-map herodocs/notebook_source/7-nemotron-personas.py— jupytext source for the Reproducing & Customizing Nemotron-Personas tutorial;docs/colab_notebooks/7-nemotron-personas.ipynb— committed Colab variant; i🔧 Changed
docs/scripts/generate_colab_notebooks.py— adds anADDITIONAL_SETUP_CELLSmap parallelingADDITIONAL_DEPENDENCIES; injects NGC CLI install +NGC_API_KEYcells. Future devnote-paired tutorials needing extra Colab bootstrap can register one-line entries in the same map.mkdocs.yml— adds Reproducing & Customizing Nemotron-Personas under the Tutorials nav🧪 Testing
make testpassesjupytext --to ipynb --executemake generate-colab-notebooksregenerates the Colab.ipynbcleanly with the NGC setup cells in the expected positionmake convert-execute-notebooksand gated onNVIDIA_API_KEY+ on-disk NGC dataset, matching how Tutorials 5/6 are gated onOPENROUTER_API_KEY)✅ Checklist