Skip to content

Add notebook fine tuning and trainer quickstart#98

Open
typhoonzero wants to merge 6 commits intomasterfrom
add_fine_tuning_and_trainer_quickstart
Open

Add notebook fine tuning and trainer quickstart#98
typhoonzero wants to merge 6 commits intomasterfrom
add_fine_tuning_and_trainer_quickstart

Conversation

@typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Feb 11, 2026

Summary by CodeRabbit

  • Documentation
    • Rebranded fine-tuning guide to "Workbench": updated workflow language, configurable model/dataset/output placeholders, container/runtime and PVC guidance, MLflow experiment tracking, security and resource tips, non‑NVIDIA GPU notes, and revised inference publishing steps.
    • Added PyTorch DDP notebook demonstrating distributed Fashion MNIST training with Kubeflow Trainer.
    • Added Kubeflow Trainer v2 quick-start for distributed PyTorch on Kubernetes.

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Rebrands fine-tuning docs to "Alauda AI Workbench", replaces fixed endpoints with placeholders, switches Dockerfile→Containerfile, adds env/PVC/security/MLFlow/runtime changes for Workbench-driven VolcanoJob flows, and adds two Kubeflow Trainer how-tos (PyTorch DDP notebook and ClusterTrainingRuntime quick-start).

Changes

Cohort / File(s) Summary
Fine-tuning doc (Workbench)
docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx
Rebranded to "Alauda AI Workbench"; replaced hard-coded model/dataset endpoints with placeholders; Dockerfile → Containerfile terminology; added HF_HOME, DO_MERGE, BASE_MODEL_URL, DATASET_URL, OUTPUT_MODEL_URL env vars; renamed PVC/resource keys (e.g., wy-model-cache, sc-topolvm); tightened requests/limits and added explicit securityContext and shm handling; updated VolcanoJob task YAML and submission script; clarified MLflow exposure and inference publishing; expanded non-NVIDIA GPU guidance and NFS mount notes.
Kubeflow Trainer — notebook
docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb
New Jupyter notebook showing PyTorch DDP training (model, init, DistributedSampler, DDP), local execution via Kubeflow TrainerClient, runtime discovery, TrainJob submission/monitoring, log streaming, and cleanup notes.
Kubeflow Trainer — quick start
docs/en/workbench/how_to/kubeflow_trainer_quick_start.md
New quick-start for building/using a torch-distributed runtime image (Containerfile), creating a ClusterTrainingRuntime YAML with security/shm settings, example TrainJob for MNIST, and notes on networking, runtimes, and next steps.

Sequence Diagram(s)

sequenceDiagram
participant User
participant Workbench
participant VolcanoJob
participant ModelRegistry
participant MLFlow
User->>Workbench: create fine-tune task (workbench UI / notebook)
Workbench->>ModelRegistry: resolve BASE_MODEL_URL / placeholders
Workbench->>VolcanoJob: submit task (env vars, securityContext, PVCs)
VolcanoJob->>ModelRegistry: download base model & dataset (HF_HOME / CACHE)
VolcanoJob->>MLFlow: log experiment/metrics
VolcanoJob-->>Workbench: task status/logs
Workbench->>MLFlow: surface UI links / publish inference metadata
Loading
sequenceDiagram
participant User
participant Notebook
participant TrainerClient
participant ClusterTrainingRuntime
participant WorkerNodes
participant PVC
User->>Notebook: prepare distributed training (DDP)
Notebook->>TrainerClient: submit TrainJob (num_nodes, runtime image)
TrainerClient->>ClusterTrainingRuntime: schedule job
ClusterTrainingRuntime->>WorkerNodes: start containers (shm, securityContext)
WorkerNodes->>PVC: read/write dataset & model cache
WorkerNodes-->>TrainerClient: stream logs/status
TrainerClient-->>Notebook: propagate logs/results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • zhaomingkun1030

Poem

🐇 I hopped through docs and nudged each line with care,
Replaced fixed links with placeholders fair.
Containerfiles baked, PVCs set in rows,
Kubeflow trains hum where distributed learning grows,
Hop — Workbench gardens bloom where pipelines flow.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding documentation for notebook-based fine-tuning and a trainer quickstart guide, which aligns with the three modified/new documentation files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add_fine_tuning_and_trainer_quickstart

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 440: Replace the unhyphenated phrase "fine tuning" with the hyphenated
form "fine-tuning" in the sentence that begins with "model cache PVC
(optional):" (the phrase appears as "fine tuning experiments" and should be
"fine-tuning experiments") so the compound modifier is consistent with the rest
of the document.
- Line 102: Update the Containerfile fenced code block so it includes a language
identifier by replacing the opening triple backticks with ```dockerfile (and
keep the matching closing triple backticks), i.e., locate the Containerfile code
block in the fine_tunning_using_notebooks.mdx content and add the dockerfile
language tag to the opening fence so syntax highlighting is enabled.
- Around line 298-312: The main-container shell uses "set -ex" which will print
commands (including the sensitive git push with ${gitauth}) to logs; change the
shell behavior to avoid tracing secrets by removing or limiting "-x" (use "set
-e" only) or temporally disabling xtrace around the sensitive operation (turn
off xtrace before the git push that uses ${gitauth} and re-enable afterward).
Locate the main container block where "set -ex" is set and the git push that
embeds "${gitauth}" and update it so command tracing does not expose
credentials.
- Around line 228-238: The script currently uses "set -ex" which prints every
command (including the gitauth/GIT_USER:GIT_TOKEN used in the git clone command
and BASE_MODEL_URL) to logs; change this to avoid leaking credentials by
removing the -x (use "set -e") or by disabling tracing around the sensitive
commands: keep tracing off when building gitauth and when running the git clone
and lfs pull (use "set +x" before the git clone
"https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set -x" after), or replace
inline credentials with a git credential helper; update occurrences of set -ex,
gitauth, and the git clone / lfs pull invocations accordingly.
- Line 80: Replace the full-width right parenthesis used in the example call to
datasets.load_dataset (the symbol shown as )) with the standard ASCII right
parenthesis ) so the snippet "ds = datasets.load_dataset(<dataset directory>)"
becomes valid Python; locate the example line in the
fine_tunning_using_notebooks.mdx file and correct the punctuation for the
datasets.load_dataset invocation.
- Line 515: Fix the typos in the Note sentence starting with "**Note:** The
following steps can also be adapt to LLM pre-training and traditional ML
senarios." by changing "adapt" to "adapted" and "senarios" to "scenarios" so the
sentence reads "...can also be adapted to LLM pre-training and traditional ML
scenarios." Update that exact string in the document.
- Around line 243-251: The two unconditional deletions ("rm -rf ${DATASET_NAME}"
and "rm -rf data") make the subsequent existence check around ${DATASET_NAME}
dead code so the dataset is always re-cloned; remove those rm -rf lines (or move
them to run only when you explicitly want to force-refresh) so the if [ -d
${DATASET_NAME} ] { echo "dataset ${DATASET_NAME} already exists skipping
download" } else { git ... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" }
branch can correctly skip cloning when the dataset is present.
- Line 377: The shell conditional uses an unquoted variable (DO_MERGE) which
will break if unset or empty; fix the conditional that currently reads the test
against DO_MERGE by quoting the variable reference and using a POSIX-compatible
string comparison operator (or switch to a bash [[ ... ]] test) so the test
becomes robust when DO_MERGE is empty or unset; update the conditional that
contains DO_MERGE to use the quoted form and a proper "=" comparison.

In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb`:
- Around line 295-301: Replace or remove the notebook cell output that exposes
the internal registry URL in the Runtime object's image field; locate the
runtime printout (Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.
- Around line 314-339: The narrative and saved outputs describe training on 3
PyTorch nodes but the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.
- Around line 160-167: Fix the typo in the inline comment above the DataLoader
instantiation: change "Shard the dataset accross workers." to "Shard the dataset
across workers." in the notebook cell where train_loader is created (referencing
DataLoader, dataset, batch_size, and DistributedSampler).

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 114: Fix the typo in the sentence starting "For more informatoin about
how to use **Kubeflow Trainer v2**..." by replacing "informatoin" with
"information" and add a trailing period at the end of the sentence so it reads:
"For more information about how to use **Kubeflow Trainer v2**, please refer to
[Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)."
- Line 15: Replace the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.
- Around line 17-37: Add the Dockerfile language identifier to the fenced code
block that starts with "FROM python:3.13-trixie" so the block becomes
```dockerfile; edit the markdown section containing the Dockerfile snippet (the
fenced block showing FROM python:3.13-trixie, ARG USERNAME, RUN pip install ...,
etc.) and prepend "dockerfile" immediately after the opening triple backticks to
satisfy markdownlint MD040 and enable proper syntax highlighting.


```dockerfile
FROM nvcr.io/nvidia/pytorch:24.12-py3
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify a language for the fenced code block.

Same as in the quick-start doc, the Containerfile code block should have a language identifier (e.g., dockerfile).

-```
+```dockerfile
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 102,
Update the Containerfile fenced code block so it includes a language identifier
by replacing the opening triple backticks with ```dockerfile (and keep the
matching closing triple backticks), i.e., locate the Containerfile code block in
the fine_tunning_using_notebooks.mdx content and add the dockerfile language tag
to the opening fence so syntax highlighting is enabled.

Comment on lines +228 to +238
set -ex
cd /mnt/models
BASE_MODEL_NAME=$(basename ${BASE_MODEL_URL})
# Download base model
gitauth="${GIT_USER}:${GIT_TOKEN}"
BASE_MODEL_URL_NO_HTTPS="${BASE_MODEL_URL//https:\/\/}"
if [ -d ${BASE_MODEL_NAME} ]; then
echo "${BASE_MODEL_NAME} dir already exists, skip downloading"
else
GIT_LFS_SKIP_SMUDGE=1 git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}"
(cd ${BASE_MODEL_NAME} && git -c http.sslVerify=false -c lfs.activitytimeout=36000 lfs pull)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Credentials may leak to logs via set -ex.

set -ex causes every command to be printed before execution, including the git clone "https://${gitauth}@..." commands on lines 237, 238, and 250. This will expose GIT_USER:GIT_TOKEN in pod logs.

Consider either:

  • Using set -e instead of set -ex (remove trace), or
  • Wrapping sensitive commands with set +x / set -x, or
  • Using git credential helpers instead of inline URL credentials.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 228 -
238, The script currently uses "set -ex" which prints every command (including
the gitauth/GIT_USER:GIT_TOKEN used in the git clone command and BASE_MODEL_URL)
to logs; change this to avoid leaking credentials by removing the -x (use "set
-e") or by disabling tracing around the sensitive commands: keep tracing off
when building gitauth and when running the git clone and lfs pull (use "set +x"
before the git clone "https://${gitauth}@${BASE_MODEL_URL_NO_HTTPS}" and "set
-x" after), or replace inline credentials with a git credential helper; update
occurrences of set -ex, gitauth, and the git clone / lfs pull invocations
accordingly.

Comment on lines +243 to +251

rm -rf ${DATASET_NAME}
rm -rf data

if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: rm -rf makes the subsequent existence check dead code.

Lines 244–245 unconditionally delete ${DATASET_NAME} and data, so the if [ -d ${DATASET_NAME} ] check on line 247 will never be true. This means the "already exists, skipping download" path is unreachable, and the dataset is always re-downloaded.

It looks like the intent was to skip downloading if already cached (matching the model logic above). Remove the rm -rf lines, or restructure:

Proposed fix
-                rm -rf ${DATASET_NAME}
-                rm -rf data
-                 
                 if [ -d ${DATASET_NAME} ]; then
                     echo "dataset ${DATASET_NAME} already exists skipping download"
                 else
                     git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
                 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf ${DATASET_NAME}
rm -rf data
if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi
if [ -d ${DATASET_NAME} ]; then
echo "dataset ${DATASET_NAME} already exists skipping download"
else
git -c http.sslVerify=false -c lfs.activitytimeout=36000 clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}"
fi
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 243 -
251, The two unconditional deletions ("rm -rf ${DATASET_NAME}" and "rm -rf
data") make the subsequent existence check around ${DATASET_NAME} dead code so
the dataset is always re-cloned; remove those rm -rf lines (or move them to run
only when you explicitly want to force-refresh) so the if [ -d ${DATASET_NAME} ]
{ echo "dataset ${DATASET_NAME} already exists skipping download" } else { git
... clone "https://${gitauth}@${DATASET_URL_NO_HTTPS}" } branch can correctly
skip cloning when the dataset is present.

Comment on lines +298 to +312
- bash
- -c
- |
set -ex
echo "job workers list: ${VC_WORKER_HOSTS}"
if [ "${VC_WORKER_HOSTS}" != "" ]; then
export N_RANKS=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print NF}')
export RANK=$VC_TASK_INDEX
export MASTER_HOST=$(echo "${VC_WORKER_HOSTS}" |awk -F',' '{print $1}')
export RANK=$RANK
export WORLD_SIZE=$N_RANKS
export NNODES=$N_RANKS
export NODE_RANK=$RANK
export MASTER_ADDR=${MASTER_HOST}
export MASTER_PORT="8888"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same credential leakage concern in the main container's set -ex block.

Lines 301 and 410 have the same issue as the init container — set -ex will trace the git push command (line 410) that embeds ${gitauth} in the URL, leaking credentials to pod logs.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 298 -
312, The main-container shell uses "set -ex" which will print commands
(including the sensitive git push with ${gitauth}) to logs; change the shell
behavior to avoid tracing secrets by removing or limiting "-x" (use "set -e"
only) or temporally disabling xtrace around the sensitive operation (turn off
xtrace before the git push that uses ${gitauth} and re-enable afterward). Locate
the main container block where "set -ex" is set and the git push that embeds
"${gitauth}" and update it so command tracing does not expose credentials.

Comment on lines +295 to +301
{
"name": "stdout",
"output_type": "stream",
"text": [
"Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)\n"
]
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Output contains an internal registry URL.

The output at line 299 exposes 152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2, which appears to be an internal/private registry address. For public-facing documentation, consider clearing this output or replacing it with a sanitized example using the public image reference (alaudadockerhub/torch-distributed:v2.9.1-aml2).

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 295 -
301, Replace or remove the notebook cell output that exposes the internal
registry URL in the Runtime object's image field; locate the runtime printout
(Runtime(name='torch-distributed', trainer=RuntimeTrainer(...),
image='152-231-registry.alauda.cn:60070/mlops/torch-distributed:v2.9.1-aml2'))
and either clear the cell output or sanitize the image value to a public example
such as 'alaudadockerhub/torch-distributed:v2.9.1-aml2' so the notebook no
longer reveals the internal registry.

Comment on lines +314 to +339
"## Run the Distributed TrainJob\n",
"\n",
"Kubeflow TrainJob will train the above model on 3 PyTorch nodes."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"job_name = client.train(\n",
" trainer=CustomTrainer(\n",
" func=train_fashion_mnist,\n",
" # Set how many PyTorch nodes you want to use for distributed training.\n",
" num_nodes=1,\n",
" # Set the resources for each PyTorch node.\n",
" resources_per_node={\n",
" \"cpu\": 2,\n",
" \"memory\": \"8Gi\",\n",
" # Uncomment this to distribute the TrainJob using GPU nodes.\n",
" # \"nvidia.com/gpu\": 1,\n",
" },\n",
" ),\n",
" runtime=torch_runtime,\n",
")"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistency: markdown says "3 PyTorch nodes" but num_nodes=1.

The markdown cells at lines 316, 350, and 404 all reference 3 nodes/GPUs (e.g., "train the above model on 3 PyTorch nodes", "distributed training across 3 nodes", "we run training on 3 GPUs"). However, the code at line 329 sets num_nodes=1. The saved outputs (lines 382–387) also show 3 nodes, suggesting the notebook was run with a different configuration than what's in the code cells.

Either update num_nodes to 3 to match the narrative and outputs, or update the markdown text and clear/update the outputs to match num_nodes=1.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_mnist.ipynb` around lines 314 -
339, The narrative and saved outputs describe training on 3 PyTorch nodes but
the code passed to client.train uses CustomTrainer(...,
func=train_fashion_mnist, num_nodes=1) which is inconsistent; either change
num_nodes to 3 in the CustomTrainer invocation to match the markdown/outputs
(and optionally enable GPU resource lines) or update the markdown text and saved
outputs to reflect num_nodes=1 (clear or re-run cells so outputs no longer show
3 nodes); look for the client.train call and the CustomTrainer(num_nodes=...)
setting to make the change.


## Prepare Runtime Image

Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor grammar: "a image" → "an image".

-Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
+Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Create a `torch_distributed.Containerfile` from below contents and build a image. Or you can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
Create a `torch_distributed.Containerfile` from the contents below and build an image. Or you can use the pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`.
🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 15, Replace
the grammatical error in the sentence "Create a
`torch_distributed.Containerfile` from below contents and build a image. Or you
can use pre-built image `alaudadockerhub/torch-distributed:v2.9.1-aml2`" by
changing "a image" to "an image" (and optionally combine the two fragments into
a single sentence like "build an image, or you can use pre-built image
`alaudadockerhub/torch-distributed:v2.9.1-aml2`") so the text reads correctly.

Comment on lines +17 to +37
```
FROM python:3.13-trixie
ARG USERNAME=appuser
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources
RUN apt-get update && \
apt-get install -y build-essential

RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \
pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \
"torch==2.9.1" \
"torchvision==0.24.1"

RUN groupadd --gid $USER_GID $USERNAME \
&& useradd --uid $USER_UID --gid $USER_GID -m $USERNAME

WORKDIR /workspace
RUN chown $USERNAME:$USER_GID /workspace
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "kubeflow_trainer_quick_start.md" | head -20

Repository: alauda/aml-docs

Length of output: 117


🌐 Web query:

PyTorch 2.9.1 torchvision compatibility matrix versions

💡 Result:

For PyTorch 2.9.1, the officially paired torchvision version is:

torch torchvision
2.9.1 0.24.1
2.9.0 0.24.0
2.8.0 0.23.0

Sources: PyTorch “Previous Versions” install commands (these reflect the supported matching domain-library versions). [1]


🏁 Script executed:

sed -n '17,37p' ./docs/en/workbench/how_to/kubeflow_trainer_quick_start.md | cat -A

Repository: alauda/aml-docs

Length of output: 696


Add language identifier (dockerfile) to the fenced code block per markdownlint MD040.

The code block requires a language identifier for proper syntax highlighting and markdown linting compliance.

Proposed fix
-```
+```dockerfile
 FROM python:3.13-trixie

Note: The torch and torchvision versions are correctly paired (torch 2.9.1 officially matches torchvision 0.24.1 per PyTorch's official release matrix).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
FROM python:3.13-trixie
ARG USERNAME=appuser
ARG USER_UID=1000
ARG USER_GID=$USER_UID
RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources
RUN apt-get update && \
apt-get install -y build-essential
RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip && \
pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cu126 \
"torch==2.9.1" \
"torchvision==0.24.1"
RUN groupadd --gid $USER_GID $USERNAME \
&& useradd --uid $USER_UID --gid $USER_GID -m $USERNAME
WORKDIR /workspace
RUN chown $USERNAME:$USER_GID /workspace
```
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 17-17: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` around lines 17 -
37, Add the Dockerfile language identifier to the fenced code block that starts
with "FROM python:3.13-trixie" so the block becomes ```dockerfile; edit the
markdown section containing the Dockerfile snippet (the fenced block showing
FROM python:3.13-trixie, ARG USERNAME, RUN pip install ..., etc.) and prepend
"dockerfile" immediately after the opening triple backticks to satisfy
markdownlint MD040 and enable proper syntax highlighting.


Download [kubeflow_trainer_mnist.ipynb](./kubeflow_trainer_mnist.ipynb) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.

For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "informatoin" → "information", and add a trailing period.

-For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
+For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
For more informatoin about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow Document](https://www.kubeflow.org/docs/components/trainer/)
For more information about how to use **Kubeflow Trainer v2**, please refer to the [Kubeflow Documentation](https://www.kubeflow.org/docs/components/trainer/).
🧰 Tools
🪛 LanguageTool

[grammar] ~114-~114: Ensure spelling is correct
Context: ...t a TrainJob using pytorch. For more informatoin about how to use *Kubeflow Trainer v2...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 114, Fix
the typo in the sentence starting "For more informatoin about how to use
**Kubeflow Trainer v2**..." by replacing "informatoin" with "information" and
add a trailing period at the end of the sentence so it reads: "For more
information about how to use **Kubeflow Trainer v2**, please refer to [Kubeflow
Document](https://www.kubeflow.org/docs/components/trainer/)."

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Line 185: Replace the personal PVC name and output model URL that use the
`wy-` prefix with generic, documentation-friendly placeholders: change the
claimName value `wy-model-cache` to `model-cache` or `<your-model-cache-pvc>`
(refer to the claimName key), and update the output model URL `wy-sft-output` to
a generic name like `sft-output` or `<your-sft-output>` (refer to the output
model URL string). Ensure both occurrences (the claimName and the output model
URL) are consistent with other placeholders such as
`<your-model-registry-git-address>`.
- Around line 456-478: Update the StorageClass example to fix the likely typo in
mountPermissions and avoid hardcoded IPs: change mountPermissions from "0757" to
a more typical value (e.g., "0755" or "0775") in the StorageClass snippet
(referencing mountPermissions and the ai-nfs StorageClass) and replace the
hardcoded server IP (server: 192.168.17.28) with a placeholder such as
<your-nfs-server-ip> so the example is consistent and safe to reuse.
- Around line 113-115: The shallow clone uses "git clone --depth 1" which may
not include the tag referenced by LLAMA_FACTORY_VERSION, causing "git checkout
${LLAMA_FACTORY_VERSION}" to fail; update the clone invocation that contains
"git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git" to either
include the tag/branch with "--branch ${LLAMA_FACTORY_VERSION} --depth 1" or
perform a full clone (remove "--depth 1") so that "git checkout
${LLAMA_FACTORY_VERSION}" succeeds.
- Line 407: The git lfs track command currently uses an unquoted glob (the
pattern *.safetensors) which can be expanded by the shell; update the occurrence
of the git lfs track invocation so the glob pattern is quoted (surround the
*.safetensors pattern with quotes) to ensure the tracking rule is registered
instead of being expanded to filenames.
- Around line 638-647: The YAML example in the non-Nvidia GPU section
incorrectly uses Nvidia-specific resource names (nvidia.com/gpualloc,
nvidia.com/gpucores, nvidia.com/gpumem); update those keys to vendor-agnostic
placeholders (e.g., <vendor>.com/gpualloc, <vendor>.com/gpucores,
<vendor>.com/gpumem or a single generic accelerator resource like
<vendor>.com/accelerator) to match the guidance on line 521 and avoid
Nvidia-specific terminology in this section.
- Around line 305-307: Remove the redundant self-assignment of the environment
variable: delete the line that does "export RANK=$RANK" (the no-op after "export
RANK=$VC_TASK_INDEX"); leave the "export RANK=$VC_TASK_INDEX" and "export
MASTER_HOST=..." lines intact so RANK is set only once from VC_TASK_INDEX.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 11, 2026

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0fa921a
Status: ✅  Deploy successful!
Preview URL: https://8e38e0e4.alauda-ai.pages.dev
Branch Preview URL: https://add-fine-tuning-and-trainer.alauda-ai.pages.dev

View logs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 291-296: The MLFLOW_TRACKING_URI value is inconsistent across
examples; update both occurrences of the MLFLOW_TRACKING_URI environment
variable (the value currently set to
"http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant) to the
same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.
- Around line 102-116: The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.

In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md`:
- Line 112: Update the link text and branding in the sentence that begins
"Download `kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and
replacing the vague link label "github workbench howtos" with a more descriptive
label (for example: "AML Workbench how-tos on GitHub" or "the AML Workbench
how-tos repository on GitHub"), keeping the same URL; locate the exact string
"Download `kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.

Comment on lines +102 to +116
```
ARG LLAMA_FACTORY_VERSION="v0.9.4"
FROM python:3.13-trixie

RUN sed -i 's@//.*archive.ubuntu.com@//mirrors.ustc.edu.cn@g' /etc/apt/sources.list && \
sed -i 's/security.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && \
RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list.d/debian.sources && \
apt-get update && \
export DEBIAN_FRONTEND=noninteractive && \
apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg && \
apt-get install -yq --no-install-recommends git git-lfs unzip curl ffmpeg default-libmysqlclient-dev build-essential pkg-config && \
apt clean && rm -rf /var/lib/apt/lists/*

RUN cd /opt && \
RUN pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
cd /opt && \
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \
cd LLaMA-Factory && \
pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple && \
pip install --no-cache-dir "transformers==4.51.1" "tokenizers==0.21.1" -i https://pypi.tuna.tsinghua.edu.cn/simple

RUN apt-get update && apt-get install -y default-libmysqlclient-dev build-essential pkg-config && \
pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple -U pip setuptools && \
pip install --no-cache-dir -i https://pypi.tuna.tsinghua.edu.cn/simple \
"sqlalchemy==2.0.30" "pymysql==1.1.1" "loguru==0.7.2" "mysqlclient==2.2.7" "mlflow"
cd LLaMA-Factory && git checkout ${LLAMA_FACTORY_VERSION} && \
pip install --no-cache-dir -e ".[torch,metrics,deepspeed,awq,modelscope]" -i https://pypi.tuna.tsinghua.edu.cn/simple
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: ARG declared before FROM is not available in subsequent RUN instructions.

In a Containerfile/Dockerfile, an ARG before FROM is only usable within FROM directives. After FROM, all build args are reset. So ${LLAMA_FACTORY_VERSION} on line 115 will resolve to an empty string, and git checkout will fail.

Move the ARG after FROM, or re-declare it:

Proposed fix
-ARG LLAMA_FACTORY_VERSION="v0.9.4"
 FROM python:3.13-trixie
+ARG LLAMA_FACTORY_VERSION="v0.9.4"

Additionally, as noted in a previous review, this code block should have a dockerfile language identifier, and git clone --depth 1 won't fetch the tag — use --branch ${LLAMA_FACTORY_VERSION} instead.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 102 -
116, The ARG LLAMA_FACTORY_VERSION is declared before FROM so
${LLAMA_FACTORY_VERSION} is empty in later RUN steps; move or re-declare ARG
LLAMA_FACTORY_VERSION immediately after the FROM line so it is available to
subsequent RUN commands, and update the git clone invocation in the RUN that
currently uses "git clone --depth 1" to use "git clone --branch
${LLAMA_FACTORY_VERSION} --depth 1" (or otherwise fetch the specific tag)
instead of relying on a later git checkout, and ensure the documentation code
fence uses the "dockerfile" language identifier; locate the relevant FROM, ARG
LLAMA_FACTORY_VERSION, and RUN lines and the git clone/git checkout commands to
apply these changes.

Comment on lines +291 to +296
# Change the MLFLOW_TRACKING_URI to your actual mlflow service endpoint
- name: MLFLOW_TRACKING_URI
value: 'http://mlflow-tracking-server.aml-system.svc.cluster.local:5000'
value: "http://mlflow-tracking-server.kubeflow:5000"
# Set MLFLOW_EXPERIMENT_NAME to your namespace or your own experiment name
- name: MLFLOW_EXPERIMENT_NAME
value: kubeflow-admin-cpaas-io
value: mlops-demo-ai-test
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent MLFlow tracking URIs between examples.

Line 293 uses http://mlflow-tracking-server.kubeflow:5000 while line 626 uses http://mlflow-tracking-server.aml-system.svc.cluster.local:5000. These point to different namespaces (kubeflow vs aml-system) and use different DNS formats (short vs FQDN). This could confuse users copying the examples.

Consider using a consistent placeholder like http://<mlflow-tracking-server-endpoint>:5000 in both, or at minimum use the same namespace and format.

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 291 -
296, The MLFLOW_TRACKING_URI value is inconsistent across examples; update both
occurrences of the MLFLOW_TRACKING_URI environment variable (the value currently
set to "http://mlflow-tracking-server.kubeflow:5000" and the other FQDN variant)
to the same consistent placeholder (e.g.,
"http://<mlflow-tracking-server-endpoint>:5000") so users aren’t confused by
different namespaces/DNS formats; ensure the MLFLOW_TRACKING_URI value is
identical in the example blocks that include the MLFLOW_TRACKING_URI and keep
MLFLOW_EXPERIMENT_NAME unchanged.


> **Note: You need internet access to run below example notebook, since you need to install python packages, download datasets in this notebook.**

Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Capitalize "GitHub" and improve link text.

"github" should be capitalized per its official branding. The link text "github workbench howtos" is also vague — consider a more descriptive label.

-Download `kubeflow_trainer_mnist.ipynb` from [github workbench howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using pytorch.
+Download `kubeflow_trainer_mnist.ipynb` from the [GitHub workbench how-to directory](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to) and drag-and-drop the file into your notebook instance. Follow the guide in this notebook to start a `TrainJob` using PyTorch.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~112-~112: The official name of this software platform is spelled with a capital “H”.
Context: ...ad kubeflow_trainer_mnist.ipynb from [github workbench howtos](https://github.com/al...

(GITHUB)

🤖 Prompt for AI Agents
In `@docs/en/workbench/how_to/kubeflow_trainer_quick_start.md` at line 112, Update
the link text and branding in the sentence that begins "Download
`kubeflow_trainer_mnist.ipynb`..." by capitalizing "GitHub" and replacing the
vague link label "github workbench howtos" with a more descriptive label (for
example: "AML Workbench how-tos on GitHub" or "the AML Workbench how-tos
repository on GitHub"), keeping the same URL; locate the exact string "Download
`kubeflow_trainer_mnist.ipynb` from [github workbench
howtos](https://github.com/alauda/aml-docs/tree/master/docs/en/workbench/how_to)"
and modify only the link text and capitalization.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (12)
docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx (12)

291-296: Inconsistent MLFlow tracking URIs between the two example blocks.

Line 293 uses http://mlflow-tracking-server.kubeflow:5000 while line 626 uses http://mlflow-tracking-server.aml-system.svc.cluster.local:5000 — different namespaces and DNS formats. Still unresolved from the previous review. Use a consistent placeholder such as http://<mlflow-tracking-server-endpoint>:5000 in both locations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 291 -
296, The MLFLOW_TRACKING_URI examples are inconsistent across the two example
blocks; locate the occurrences of the environment variable MLFLOW_TRACKING_URI
in both example blocks (the YAML env var entries named MLFLOW_TRACKING_URI) and
replace the hard-coded endpoints with a consistent placeholder such as
http://<mlflow-tracking-server-endpoint>:5000 so both blocks use the same
placeholder format.

185-185: PVC name wy-model-cache looks like a personal/dev artifact.

Still unaddressed from the previous review. Use a generic placeholder such as <your-model-cache-pvc> to match the style of other placeholders in the document.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 185,
Replace the hard-coded PVC name "wy-model-cache" used in the Kubernetes manifest
(the claimName field) with a generic placeholder to match the docs style—e.g.,
change claimName: wy-model-cache to claimName: <your-model-cache-pvc> (or
similar) wherever the manifest shows claimName so readers supply their own PVC
name.

407-407: Unquoted glob *.safetensors will be shell-expanded before git lfs track.

If matching files exist in the directory the shell expands the glob to individual filenames instead of registering the tracking rule. Still unresolved from the previous review. Quote the pattern: git lfs track "*.safetensors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 407,
Replace the unquoted glob in the git LFS tracking instruction so the shell
doesn't expand it; change the command string "git lfs track *.safetensors" to
use a quoted pattern (e.g., git lfs track "*.safetensors") in the
docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx so the tracking rule
is registered rather than expanded by the shell.

456-478: mountPermissions: "0757" is likely a typo, and the NFS server IP is hardcoded.

Both issues from the previous review are still present: 0757 gives world-writable access while restricting the group (unusual); and 192.168.17.28 should be a placeholder like <your-nfs-server-ip>.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 456 -
478, In the StorageClass YAML, correct the erroneous mountPermissions and remove
the hardcoded NFS IP: change mountPermissions: "0757" to a safer standard like
mountPermissions: "0755" (or another org-approved mask) and replace server:
192.168.17.28 with a placeholder such as server: "<your-nfs-server-ip>" so users
supply their own NFS server; update the share value only if needed for clarity.

440-440: "fine tuning" still unhyphenated on line 440.

The previous review marked this as addressed (commit d5a5ee2), but the current code still reads "fine tuning experiments". LanguageTool continues to flag this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 440,
Update the phrase "fine tuning experiments" to use the hyphenated form
"fine-tuning experiments" in the sentence that begins with "model cache PVC
(optional):" so the documentation uses "fine-tuning" consistently; locate the
exact string "fine tuning experiments" and replace it with "fine-tuning
experiments".

638-647: Non-Nvidia GPU section requests nvidia.com/* resources.

The section is explicitly for NPU/Gaudi/AMD etc., yet the example YAML limits nvidia.com/gpualloc, nvidia.com/gpucores, and nvidia.com/gpumem. Still unresolved from the previous review. Replace with vendor-agnostic placeholders.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 638 -
647, The YAML in the Non-Nvidia GPU section incorrectly uses NVIDIA-specific
resource keys (nvidia.com/gpualloc, nvidia.com/gpucores, nvidia.com/gpumem);
update those limits to vendor-agnostic placeholders (e.g., replace
nvidia.com/gpualloc with <vendor.com/gpualloc> or generic gpu.alloc,
nvidia.com/gpucores with <vendor.com/gpucores> or gpu.cores, and
nvidia.com/gpumem with <vendor.com/gpumem> or gpu.mem) so the example applies to
NPU/Gaudi/AMD and other vendors while keeping the same numeric values (1, 50,
8192) and retaining the surrounding limits/requests keys.

228-257: set -ex leaks Git credentials, and rm -rf makes the dataset existence check dead code.

Both issues from the previous review remain:

  1. set -ex (line 228) traces every command; the git clone "https://${gitauth}@…" lines on 237–238 and 250 will print GIT_USER:GIT_TOKEN in pod logs.
  2. Lines 244–245 unconditionally rm -rf ${DATASET_NAME} and rm -rf data before the if [ -d ${DATASET_NAME} ] check on line 247, making the "already exists" branch permanently unreachable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 228 -
257, The script leaks credentials and incorrectly removes the dataset before
checking existence: remove or change the "set -ex" usage to avoid echoing
commands with secrets (use "set -e" or disable xtrace around the git clone), and
stop unconditionally deleting the dataset by removing the "rm -rf
${DATASET_NAME}" and "rm -rf data" lines so the "if [ -d ${DATASET_NAME} ]"
branch can work; additionally, ensure git clone commands that use gitauth (the
"git clone \"https://${gitauth}@${DATASET_URL_NO_HTTPS}\"" and similar for
BASE_MODEL_URL) do not print credentials (mask or avoid embedding
GIT_USER:GIT_TOKEN directly, use credential helpers or environment-based auth
and wrap any sensitive commands with xtrace off).

275-275: wy-sft-output in OUTPUT_MODEL_URL is a personal/dev artifact.

Still unaddressed. Replace with a generic placeholder consistent with the rest of the document (e.g., <your-sft-output>).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 275, The
literal personal artifact "wy-sft-output" used in the OUTPUT_MODEL_URL value
should be replaced with a generic placeholder; update the value that contains
OUTPUT_MODEL_URL so it reads something like
"https://<your-model-registry-git-address>/mlops-demo-ai-test/amlmodels/<your-sft-output>"
(replace "wy-sft-output" with "<your-sft-output>") so it matches the rest of the
document's placeholder convention.

515-515: Typos "adapt" → "adapted" and "senarios" → "scenarios" on line 515.

Still unresolved from the previous review. LanguageTool continues to flag "senarios".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` at line 515, In
the sentence starting with "**Note:** The following steps can also be adapt to
LLM pre-training and traditional ML senarios." replace "adapt" with "adapted"
and "senarios" with "scenarios" so the corrected text reads "**Note:** The
following steps can also be adapted to LLM pre-training and traditional ML
scenarios." This fixes the two typos flagged by LanguageTool.

301-312: set -ex in the main container will trace git push with embedded credentials.

set -ex on line 301 means git push -u ${PUSH_URL} sft-${push_branch} on line 410 will print the full URL including GIT_USER:GIT_TOKEN to pod logs. Still unresolved from the previous review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 301 -
312, The script currently enables shell xtrace with "set -ex", which will print
commands and reveal sensitive values when running "git push -u ${PUSH_URL}
sft-${push_branch}" (and any use of GIT_USER/GIT_TOKEN/PUSH_URL); change it so
the tracer is disabled before any push: either use "set -e" (drop -x) or wrap
the push with "set +x" before the git command and "set -x" after, or explicitly
unset or mask GIT_TOKEN/GIT_USER/PUSH_URL before logging/exec; update the block
containing "set -ex" and the code that performs "git push -u ${PUSH_URL}
sft-${push_branch}" to ensure credentials are never printed.

305-307: Redundant self-assignment export RANK=$RANK on line 307.

Line 305 already sets RANK=$VC_TASK_INDEX; line 307 re-exports it to itself — a no-op. Still unresolved from the previous review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 305 -
307, Remove the redundant self-assignment "export RANK=$RANK" (it does nothing
because RANK is already set from VC_TASK_INDEX); either delete that line or
replace it with the intended variable export if it was meant to export a
different env var (e.g., WORLD_SIZE or another value). Update the block that
sets RANK, MASTER_HOST, and VC_WORKER_HOSTS accordingly so only meaningful
exports remain.

102-116: ARG before FROM is invisible in RUN, missing dockerfile fence identifier, and shallow clone cannot resolve the tag.

These three issues from the previous review remain unaddressed:

  1. ARG LLAMA_FACTORY_VERSION declared before FROM is reset after FROM; ${LLAMA_FACTORY_VERSION} is empty in the subsequent RUN, so git checkout silently checks out nothing.
  2. The fenced code block has no language identifier (dockerfile is needed for syntax highlighting).
  3. git clone --depth 1 fetches only the default branch tip; git checkout v0.9.4 will fail if that tag is not the tip.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 102 -
116, The Dockerfile fragment has three issues: make LLAMA_FACTORY_VERSION
visible to RUN by declaring ARG LLAMA_FACTORY_VERSION after the FROM (or
re-declare ARG LLAMA_FACTORY_VERSION immediately after the FROM instruction) so
${LLAMA_FACTORY_VERSION} is populated for the git checkout step; mark the fenced
block with the dockerfile language identifier (use ```dockerfile) for proper
highlighting; and avoid shallow-clone failure by changing the git clone
invocation (e.g., remove --depth 1 or use --branch ${LLAMA_FACTORY_VERSION}
--single-branch or clone then git fetch --tags) so git checkout
${LLAMA_FACTORY_VERSION} can find the tag (refer to LLAMA_FACTORY_VERSION, the
RUN block that does git clone, and the git checkout command).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 324-363: The heredoc used to write lf-sft.yaml (via <<EOL) is
indented causing every YAML line to have leading spaces which breaks root-level
YAML parsing; remove the indentation so the heredoc body starts at column 0 (or
generate the file with printf) and keep using an unquoted <<EOL (not <<'EOL') so
${BASE_MODEL_NAME}/${DATASET_NAME} still expand; apply the same dedent/fix to
the lf-merge-config.yaml heredoc as well.

---

Duplicate comments:
In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx`:
- Around line 291-296: The MLFLOW_TRACKING_URI examples are inconsistent across
the two example blocks; locate the occurrences of the environment variable
MLFLOW_TRACKING_URI in both example blocks (the YAML env var entries named
MLFLOW_TRACKING_URI) and replace the hard-coded endpoints with a consistent
placeholder such as http://<mlflow-tracking-server-endpoint>:5000 so both blocks
use the same placeholder format.
- Line 185: Replace the hard-coded PVC name "wy-model-cache" used in the
Kubernetes manifest (the claimName field) with a generic placeholder to match
the docs style—e.g., change claimName: wy-model-cache to claimName:
<your-model-cache-pvc> (or similar) wherever the manifest shows claimName so
readers supply their own PVC name.
- Line 407: Replace the unquoted glob in the git LFS tracking instruction so the
shell doesn't expand it; change the command string "git lfs track *.safetensors"
to use a quoted pattern (e.g., git lfs track "*.safetensors") in the
docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx so the tracking rule
is registered rather than expanded by the shell.
- Around line 456-478: In the StorageClass YAML, correct the erroneous
mountPermissions and remove the hardcoded NFS IP: change mountPermissions:
"0757" to a safer standard like mountPermissions: "0755" (or another
org-approved mask) and replace server: 192.168.17.28 with a placeholder such as
server: "<your-nfs-server-ip>" so users supply their own NFS server; update the
share value only if needed for clarity.
- Line 440: Update the phrase "fine tuning experiments" to use the hyphenated
form "fine-tuning experiments" in the sentence that begins with "model cache PVC
(optional):" so the documentation uses "fine-tuning" consistently; locate the
exact string "fine tuning experiments" and replace it with "fine-tuning
experiments".
- Around line 638-647: The YAML in the Non-Nvidia GPU section incorrectly uses
NVIDIA-specific resource keys (nvidia.com/gpualloc, nvidia.com/gpucores,
nvidia.com/gpumem); update those limits to vendor-agnostic placeholders (e.g.,
replace nvidia.com/gpualloc with <vendor.com/gpualloc> or generic gpu.alloc,
nvidia.com/gpucores with <vendor.com/gpucores> or gpu.cores, and
nvidia.com/gpumem with <vendor.com/gpumem> or gpu.mem) so the example applies to
NPU/Gaudi/AMD and other vendors while keeping the same numeric values (1, 50,
8192) and retaining the surrounding limits/requests keys.
- Around line 228-257: The script leaks credentials and incorrectly removes the
dataset before checking existence: remove or change the "set -ex" usage to avoid
echoing commands with secrets (use "set -e" or disable xtrace around the git
clone), and stop unconditionally deleting the dataset by removing the "rm -rf
${DATASET_NAME}" and "rm -rf data" lines so the "if [ -d ${DATASET_NAME} ]"
branch can work; additionally, ensure git clone commands that use gitauth (the
"git clone \"https://${gitauth}@${DATASET_URL_NO_HTTPS}\"" and similar for
BASE_MODEL_URL) do not print credentials (mask or avoid embedding
GIT_USER:GIT_TOKEN directly, use credential helpers or environment-based auth
and wrap any sensitive commands with xtrace off).
- Line 275: The literal personal artifact "wy-sft-output" used in the
OUTPUT_MODEL_URL value should be replaced with a generic placeholder; update the
value that contains OUTPUT_MODEL_URL so it reads something like
"https://<your-model-registry-git-address>/mlops-demo-ai-test/amlmodels/<your-sft-output>"
(replace "wy-sft-output" with "<your-sft-output>") so it matches the rest of the
document's placeholder convention.
- Line 515: In the sentence starting with "**Note:** The following steps can
also be adapt to LLM pre-training and traditional ML senarios." replace "adapt"
with "adapted" and "senarios" with "scenarios" so the corrected text reads
"**Note:** The following steps can also be adapted to LLM pre-training and
traditional ML scenarios." This fixes the two typos flagged by LanguageTool.
- Around line 301-312: The script currently enables shell xtrace with "set -ex",
which will print commands and reveal sensitive values when running "git push -u
${PUSH_URL} sft-${push_branch}" (and any use of GIT_USER/GIT_TOKEN/PUSH_URL);
change it so the tracer is disabled before any push: either use "set -e" (drop
-x) or wrap the push with "set +x" before the git command and "set -x" after, or
explicitly unset or mask GIT_TOKEN/GIT_USER/PUSH_URL before logging/exec; update
the block containing "set -ex" and the code that performs "git push -u
${PUSH_URL} sft-${push_branch}" to ensure credentials are never printed.
- Around line 305-307: Remove the redundant self-assignment "export RANK=$RANK"
(it does nothing because RANK is already set from VC_TASK_INDEX); either delete
that line or replace it with the intended variable export if it was meant to
export a different env var (e.g., WORLD_SIZE or another value). Update the block
that sets RANK, MASTER_HOST, and VC_WORKER_HOSTS accordingly so only meaningful
exports remain.
- Around line 102-116: The Dockerfile fragment has three issues: make
LLAMA_FACTORY_VERSION visible to RUN by declaring ARG LLAMA_FACTORY_VERSION
after the FROM (or re-declare ARG LLAMA_FACTORY_VERSION immediately after the
FROM instruction) so ${LLAMA_FACTORY_VERSION} is populated for the git checkout
step; mark the fenced block with the dockerfile language identifier (use
```dockerfile) for proper highlighting; and avoid shallow-clone failure by
changing the git clone invocation (e.g., remove --depth 1 or use --branch
${LLAMA_FACTORY_VERSION} --single-branch or clone then git fetch --tags) so git
checkout ${LLAMA_FACTORY_VERSION} can find the tag (refer to
LLAMA_FACTORY_VERSION, the RUN block that does git clone, and the git checkout
command).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0fa921a and e6ba459.

📒 Files selected for processing (1)
  • docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx

Comment on lines +324 to +363
cat >lf-sft.yaml <<EOL
model_name_or_path: /mnt/models/${BASE_MODEL_NAME}

stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.1

dataset: identity_alauda
dataset_dir: /mnt/models/${DATASET_NAME}
template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 8

output_dir: output_models
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

# global batch size: 8
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 2.0e-4
num_train_epochs: 4.0
bf16: false
fp16: true
ddp_timeout: 180000000

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
report_to: mlflow
EOL
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Heredoc produces invalid YAML — training and merge configs will fail to parse.

<<EOL (without -) preserves all leading whitespace verbatim. Because the cat command is indented ~16 spaces inside the shell script, every line written to lf-sft.yaml and lf-merge-config.yaml will begin with those same 16 spaces. Root-level YAML keys must not be indented; PyYAML (used by LLaMA-Factory) will raise a ScannerError at parse time, causing llamafactory-cli train / llamafactory-cli export to fail immediately.

🐛 Proposed fix — dedent the heredoc body or use a `printf`-style approach

The cleanest fix is to simply not indent the heredoc body lines (or use a helper that strips indentation). Bash's <<-EOL only strips tabs, not spaces, so it is not sufficient here. The simplest correction:

-                cat >lf-sft.yaml <<EOL
-                model_name_or_path: /mnt/models/${BASE_MODEL_NAME}
-   
-                stage: sft
-                do_train: true
-                ...
-                EOL
+                cat >lf-sft.yaml <<'EOL'
+model_name_or_path: /mnt/models/${BASE_MODEL_NAME}
+
+stage: sft
+do_train: true
+...
+EOL

Note: switching to <<'EOL' (quoted) disables variable substitution inside the heredoc body; if you need ${BASE_MODEL_NAME} / ${DATASET_NAME} to expand, keep <<EOL but still move the body lines to column 0.

Apply the same fix to the lf-merge-config.yaml heredoc at lines 379–390.

Also applies to: 379-390

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/workbench/how_to/fine_tunning_using_notebooks.mdx` around lines 324 -
363, The heredoc used to write lf-sft.yaml (via <<EOL) is indented causing every
YAML line to have leading spaces which breaks root-level YAML parsing; remove
the indentation so the heredoc body starts at column 0 (or generate the file
with printf) and keep using an unquoted <<EOL (not <<'EOL') so
${BASE_MODEL_NAME}/${DATASET_NAME} still expand; apply the same dedent/fix to
the lf-merge-config.yaml heredoc as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant