Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 14 additions & 10 deletions .claude/skills/trtllm-model-onboard-multimodal/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,11 @@ metadata:
asyncio.gather then decodes all media for one request in parallel.

[2] Input pipeline (asyncio.to_thread, off the event loop)
BaseMultimodalInputProcessor.__call__:
BaseMultimodalInputProcessor.__call__ dispatches by input shape: a text
prompt goes to the per-model call_with_text_prompt; prompt_token_ids +
mm_data goes to the base-class call_with_token_ids fast path (or is
detokenized back to call_with_text_prompt when the model opts out). The
per-model HF processing lives in call_with_text_prompt:
HF AutoProcessor → pixel_values + token_ids
mm-token layout (positions / lengths / special_token_offsets)
(mRoPE) mrope_position_ids + deltas computed on CPU
Expand Down Expand Up @@ -79,7 +83,7 @@ metadata:
When `@support_multimodal_disaggregated` is set and the deployment uses `TLLM_MULTIMODAL_DISAGGREGATED=1`:

- **Encoder worker:** runs as a standalone `MultimodalEncoder` (`mm_encoder_only=True`). It executes only the multimodal encoder and ships `mm_embeddings` (+ mRoPE position ids/deltas) to prefill+decode workers as shared-tensor handles.
- **Prefill+decode worker:** the model's `__init__` skips constructing `self.mm_encoder` when `_is_disagg()` is true; the input processor's `attach_multimodal_embeddings()` override binds the encoder handles into the request. For context-only requests, the engine re-clones mrope tensors so IPC handles outlive the encoder worker's freed memory — replicate that pattern for any new GPU-resident mm tensors.
- **Prefill+decode worker:** the model's `__init__` skips constructing `self.mm_encoder` when `_is_disagg()` is true; the input processor's `_attach_multimodal_embeddings_impl()` override binds the encoder handles into the request (the base `attach_multimodal_embeddings` wrapper detokenizes tokenized inputs for non-fast-path VLMs, then delegates to your impl). For context-only requests, the engine re-clones mrope tensors so IPC handles outlive the encoder worker's freed memory — replicate that pattern for any new GPU-resident mm tensors.

### Templates to study

Expand Down Expand Up @@ -159,11 +163,11 @@ Three-arg **`torch.where(cond, x, y)`** is fine when **`cond`** is built only on

CPU-bound work (decode / resize / normalize / mel-spectrogram / frame extraction) must not compete with GPU work, block the request loop, or serialize across requests.

- HF AutoProcessor + image_processor + tokenizer run inside `BaseMultimodalInputProcessor.__call__` — *not* in the model worker.
- HF AutoProcessor + image_processor + tokenizer run inside the input processor's `call_with_text_prompt` (dispatched from `__call__`) — *not* in the model worker.
- URL/bytes media goes through `async_load_image` / `async_load_video` / `async_load_audio` (all wrap blocking decode in `asyncio.to_thread`). Never call `PIL.Image.open(...).load()` / `cv2.VideoCapture` / `soundfile.read` synchronously on the request hot path.
- Pin host tensors before H2D with `prefer_pinned()` (False under Confidential Compute (CC), True otherwise). The engine pins `multimodal_data` automatically via `to_device(..., pin_memory=prefer_pinned())`.
- **Declare `multimodal_data_device_paths`** on the model — list of dotted paths (e.g. `["image.pixel_values", "image.image_grid_thw", "video.pixel_values_videos", "video.video_grid_thw", "multimodal_embedding"]`) telling the engine which fields go to CUDA. Anything not listed stays on CPU.
- Optional (refactor pending): `get_text_with_mm_placeholders` + `expand_prompt_token_ids_for_mm` enable the tokenized+MM fast path (`tokenized_multimodal_process`), skipping redundant detokenization. A cleaner alternative is being designed — skip unless you have a specific need.
- Optional tokenized+MM fast path: set `supports_token_id_mm_expansion = True` (a `ClassVar`, default `False`) and implement `get_text_with_mm_placeholders` + `expand_prompt_token_ids_for_mm`. The base-class `__call__` then routes `prompt_token_ids + multi_modal_data` (no `prompt`) requests through `call_with_token_ids`, skipping redundant detokenization. When the flag is `False` (most VLMs), the base class detokenizes `prompt_token_ids → prompt` and re-runs `call_with_text_prompt`, so token-ID inputs still work — just less efficiently. Only LlavaNext + NanoV2VL opt in today.
- Forward `mm_processor_kwargs` from `inputs.get("mm_processor_kwargs", {})` to the HF processor (callers tune things like video sample rate via this).

### Contract 3 — Large media via shared tensors, never raw pickle
Expand Down Expand Up @@ -302,7 +306,7 @@ class {Name}Model(PreTrainedModel):

Subclass **both** `BaseMultimodalInputProcessor` (drives every real request) and `BaseMultimodalDummyInputsBuilder` (drives engine warmup / profiling — the base shrinks dummy image resolution until the synthetic prompt fits `input_seq_len`). Colocate in the modeling file. Reference: `Qwen3VLInputProcessorBase`.

`__call__(inputs, sampling_params)` does:
Implement `call_with_text_prompt(inputs, sampling_params)` — the per-model text-prompt path. **Don't override `__call__`**: the base class's concrete `__call__` dispatches here for text prompts, and also detokenizes `prompt_token_ids → prompt` and falls through to here for non-fast-path VLMs. `call_with_text_prompt` does:

1. Pull `text_prompt`, `mm_data`, `mm_processor_kwargs` from `inputs`.
2. `_preprocess(...)` — HF processor produces `pixel_values` / `pixel_values_videos` / `*_grid_thw` / `input_ids`.
Expand All @@ -311,9 +315,9 @@ Subclass **both** `BaseMultimodalInputProcessor` (drives every real request) and
5. `_postprocess(input_ids)` rewrites HF's `image_token_id` / `video_token_id` to `tllm_multimodal_token_id = vocab_size + 1` (the OOV sentinel). Skip when `mm_data` is empty.
6. Return `(prompt_token_ids_list, {"multimodal_data": multimodal_data})`.

**Optional overrides (refactor pending; skip unless needed):** `get_text_with_mm_placeholders(mm_counts)` + `expand_prompt_token_ids_for_mm(prompt_token_ids, num_mm_tokens, ...)` enable the tokenized fast path. A cleaner replacement is being designed.
**Optional tokenized+MM fast path (skip unless needed):** set `supports_token_id_mm_expansion = True` (`ClassVar`) and implement `get_text_with_mm_placeholders(mm_counts)` + `expand_prompt_token_ids_for_mm(prompt_token_ids, num_mm_tokens, ...)`. The base-class `call_with_token_ids` then builds dummy placeholder text, runs `call_with_text_prompt` on it, expands the real token IDs, and merges any returned `mm_data_updates` (e.g. video `evs_ids`) into `multimodal_data`. Leave the flag `False` and the base class just detokenizes token-ID inputs and re-runs `call_with_text_prompt`. Only LlavaNext + NanoV2VL opt in today.

**EPD override (if `@support_multimodal_disaggregated`):** `attach_multimodal_embeddings(inputs, multimodal_embedding, sampling_params)` consumes encoder outputs in the prefill+decode worker.
**EPD override (if `@support_multimodal_disaggregated`):** override `_attach_multimodal_embeddings_impl(inputs, multimodal_embedding, sampling_params)` — **not** the `attach_multimodal_embeddings` wrapper — to consume encoder outputs in the prefill+decode worker. The base wrapper detokenizes tokenized inputs for non-fast-path VLMs before delegating to your impl.

**Decorator stack** — bottom-up application; `register_vision_encoder` requires `register_auto_model` to have run first:

Expand Down Expand Up @@ -429,9 +433,9 @@ Follow `CONTRIBUTING.md`. Title `[JIRA/NVBUG/None][type] description`, `git comm

**Input processor**
- [ ] Subclasses both `BaseMultimodalInputProcessor` and `BaseMultimodalDummyInputsBuilder`.
- [ ] `__call__` runs HF AutoProcessor + tokenizer, builds `multimodal_data` by modality, computes `mrope_config` on CPU, `_postprocess`-rewrites mm token ids to the OOV sentinel.
- [ ] `mm_processor_kwargs` flow-through preserved. (Tokenized fast-path overrides — `get_text_with_mm_placeholders` / `expand_prompt_token_ids_for_mm` — are optional; a cleaner replacement is being designed.)
- [ ] `attach_multimodal_embeddings` implemented if `@support_multimodal_disaggregated`.
- [ ] `call_with_text_prompt` (not `__call__` — that's the base-class dispatcher) runs HF AutoProcessor + tokenizer, builds `multimodal_data` by modality, computes `mrope_config` on CPU, `_postprocess`-rewrites mm token ids to the OOV sentinel.
- [ ] `mm_processor_kwargs` flow-through preserved. (Tokenized fast path is optional: set `supports_token_id_mm_expansion = True` + implement `get_text_with_mm_placeholders` / `expand_prompt_token_ids_for_mm`; otherwise the base class detokenizes token-ID inputs automatically.)
- [ ] `_attach_multimodal_embeddings_impl` implemented (not the `attach_multimodal_embeddings` wrapper) if `@support_multimodal_disaggregated`.

**Performance contracts**
- [ ] Grep clean for Contract 1 bans (`.item()` / `.cpu()` / `.tolist()` / `torch.nonzero` / single-arg `torch.where` / value-dependent `if`, etc.) in modeling `forward` paths — elementwise `torch.where(cond, x, y)` with GPU-only `cond` is fine.
Expand Down
38 changes: 19 additions & 19 deletions security_scanning/examples/auto_deploy/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading