Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
📝 WalkthroughWalkthroughThese changes add TensorBoard observability support to the speculative decoding example, complementing existing Weights & Biases integration. The training pipeline conditionally initializes TensorBoard logging through SummaryWriter, refactors metric reporting with mode-qualified naming, and updates shell script configuration to enable TensorBoard via CLI flags. Changes
Sequence Diagram(s)sequenceDiagram
participant Trainer
participant EagleTrainingPlot as EagleTrainingPlot<br/>(Callback)
participant TensorBoard as TensorBoard<br/>SummaryWriter
participant WandB as Weights & Biases
Trainer->>EagleTrainingPlot: on_log (training step)
EagleTrainingPlot->>EagleTrainingPlot: _report_stats(state, eval_mode=False)
EagleTrainingPlot->>TensorBoard: add_scalar(parallel_i_step_j_train_acc)
EagleTrainingPlot->>WandB: log parallel_i_step_j_train_acc
Trainer->>EagleTrainingPlot: on_evaluate (evaluation)
EagleTrainingPlot->>EagleTrainingPlot: _report_stats(state, eval_mode=True)
EagleTrainingPlot->>TensorBoard: add_scalar(estimated_eval_ar)
EagleTrainingPlot->>WandB: log estimated_eval_ar
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/speculative_decoding/eagle_utils.py (1)
74-74:⚠️ Potential issue | 🟠 MajorSecurity:
torch.loadwithoutweights_only=Trueon potentially user-supplied files.This line loads
.ptfiles fromoffline_data_path, which could be user-supplied. Per coding guidelines,torch.loadshould useweights_only=Trueto prevent arbitrary code execution from malicious checkpoints. Ifweights_only=Falseis required, an inline comment must justify why it is safe.Since this is pre-existing code and not part of this PR's changes, consider addressing it in a follow-up, but it warrants attention.
🛡️ Recommended fix
- offline_data = torch.load(self.dumped_files[i]) + offline_data = torch.load(self.dumped_files[i], weights_only=True)If the offline data files contain custom objects that require
weights_only=False, add an inline comment explaining why this is safe (e.g., files are internally-generated by trusted scripts).As per coding guidelines: "Do not use
torch.load(..., weights_only=False)unless a documented exception is provided. Always useweights_only=Trueto prevent arbitrary code execution from malicious checkpoints."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/eagle_utils.py` at line 74, The call to torch.load in the line that assigns offline_data = torch.load(self.dumped_files[i]) is unsafe for potentially user-supplied files; change it to torch.load(self.dumped_files[i], weights_only=True) to avoid arbitrary code execution, or if the code genuinely needs non-weight objects, add an inline justification comment next to this call explaining why weights_only=False is safe (e.g., files are generated by a trusted internal process), and keep the rest of the logic that uses offline_data and self.dumped_files[i] unchanged.
🧹 Nitpick comments (2)
examples/speculative_decoding/launch_train.sh (1)
223-227: Consider using a more explicit boolean check.The condition
!= "False"means any value other than literal "False" enables TensorBoard (e.g.,--tensorboard=True,--tensorboard=1, or even typos like--tensorboard=Flase). This is consistent with other flags in the script, but a stricter check could prevent accidental enablement.♻️ Optional: Stricter boolean check
-if [[ "$ENABLE_TENSORBOARD" != "False" ]]; then +if [[ "$ENABLE_TENSORBOARD" == "True" ]]; then OBSERVABILITY_ARGS="--report_to tensorboard" else OBSERVABILITY_ARGS="" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/launch_train.sh` around lines 223 - 227, The current conditional uses ENABLE_TENSORBOARD != "False" which treats any non-"False" value as true; change it to perform an explicit boolean check on ENABLE_TENSORBOARD (e.g., test for "True" or "1", or a case-insensitive match like "true"/"1") before setting OBSERVABILITY_ARGS, so only intended truthy values enable "--report_to tensorboard"; update the conditional that sets OBSERVABILITY_ARGS (the ENABLE_TENSORBOARD variable and OBSERVABILITY_ARGS assignment) to use the stricter check.examples/speculative_decoding/eagle_utils.py (1)
239-249: Address or remove the TODO comment.The TODO at line 240 asks "What are in 'kwargs.logs'?" - this should either be investigated and documented, or removed if it's no longer relevant. Leaving unresolved TODOs can cause confusion for future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/eagle_utils.py` around lines 239 - 249, Remove or replace the TODO asking "What are in 'kwargs.logs'?" by either deleting it if obsolete or adding a concise explanatory comment stating that kwargs.logs is the source of the per-draft/per-step metrics used to compute average_acc (i.e., the logged metrics aggregated into average_acc), and note any expected keys/structure; update the nearby comment above the self.tb_writer block (which writes average_acc and est_ar using mode_id and state.global_step) and, if applicable, add a short note in the function/class docstring that describes kwargs.logs structure so future maintainers understand how average_acc and est_ar are derived.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/speculative_decoding/eagle_utils.py`:
- Line 74: The call to torch.load in the line that assigns offline_data =
torch.load(self.dumped_files[i]) is unsafe for potentially user-supplied files;
change it to torch.load(self.dumped_files[i], weights_only=True) to avoid
arbitrary code execution, or if the code genuinely needs non-weight objects, add
an inline justification comment next to this call explaining why
weights_only=False is safe (e.g., files are generated by a trusted internal
process), and keep the rest of the logic that uses offline_data and
self.dumped_files[i] unchanged.
---
Nitpick comments:
In `@examples/speculative_decoding/eagle_utils.py`:
- Around line 239-249: Remove or replace the TODO asking "What are in
'kwargs.logs'?" by either deleting it if obsolete or adding a concise
explanatory comment stating that kwargs.logs is the source of the
per-draft/per-step metrics used to compute average_acc (i.e., the logged metrics
aggregated into average_acc), and note any expected keys/structure; update the
nearby comment above the self.tb_writer block (which writes average_acc and
est_ar using mode_id and state.global_step) and, if applicable, add a short note
in the function/class docstring that describes kwargs.logs structure so future
maintainers understand how average_acc and est_ar are derived.
In `@examples/speculative_decoding/launch_train.sh`:
- Around line 223-227: The current conditional uses ENABLE_TENSORBOARD !=
"False" which treats any non-"False" value as true; change it to perform an
explicit boolean check on ENABLE_TENSORBOARD (e.g., test for "True" or "1", or a
case-insensitive match like "true"/"1") before setting OBSERVABILITY_ARGS, so
only intended truthy values enable "--report_to tensorboard"; update the
conditional that sets OBSERVABILITY_ARGS (the ENABLE_TENSORBOARD variable and
OBSERVABILITY_ARGS assignment) to use the stricter check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 37c34d1f-fb87-4239-869c-e70befcdb60e
📒 Files selected for processing (4)
examples/speculative_decoding/README.mdexamples/speculative_decoding/eagle_utils.pyexamples/speculative_decoding/launch_train.shexamples/speculative_decoding/main.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1070 +/- ##
=======================================
Coverage 70.30% 70.30%
=======================================
Files 227 227
Lines 25857 25857
=======================================
Hits 18179 18179
Misses 7678 7678 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Adds support for Tensorboard as a logging backend.
Also updates logging to support eval mode (which will be re-enabled in a separate PR)
Usage
Add
--tensorboard Truetolaunch_train.shTesting
I ran it to make sure the logs are reasonable. Looks fine to me
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/ASummary by CodeRabbit
Release Notes
New Features
Documentation