Skip to content

Optimize Qwen3.5#4434

Merged
lvhan028 merged 30 commits intoInternLM:mainfrom
lzhangzz:optimize-qwen3.5
Mar 21, 2026
Merged

Optimize Qwen3.5#4434
lvhan028 merged 30 commits intoInternLM:mainfrom
lzhangzz:optimize-qwen3.5

Conversation

@lzhangzz
Copy link
Collaborator

This PR improves TurboMind inference performance for Qwen3.5 models with recurrent/linear attention layers (GatedDeltaNet).

Bug Fixes

  • Fix number of KV-cached layers and MoE router type for Qwen3.5
  • Fix linear attention state management across requests
  • Guard stateful features to avoid incorrect behavior

Kernel Optimizations

  • Implement persistent kernel for GatedDeltaNet
  • Optimize conv1d and recurrent gated delta rule kernels
  • Add a serial chunked GDR kernel with benchmark utilities
  • Refactor invokeRMSNormGated to use Tensor references

Scheduling & State Management

  • Improve scheduling strategy for recurrent states
  • Batch GDN (GatedDeltaNet) execution
  • Refactor recurrent state management and lifecycle

@tuilakhanh
Copy link
Contributor

Failed to load 27B model, run with --tp 2.
[TM][FATAL] models/llama/gated_delta_net_kernels.cu(1249): Check failed: conv_dim % ch_per_blk == 0

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets higher TurboMind inference throughput for Qwen3.5 models by fixing linear-attention/MoE configuration details, introducing new persistent/batched CUDA kernels for GatedDeltaNet, and refactoring state/cache management to support the updated execution/scheduling flow.

Changes:

  • Add GatedDeltaNet batched/persistent kernels (conv1d+SiLU, recurrent rule v2/v3, chunked prefill) and update call sites to use Tensor/Buffer refs.
  • Move GatedDeltaNet persistent state from per-request storage to sequence-managed pooled state slots; add cache/state invalidation guards.
  • Update Qwen3.5 export/model metadata and KV-cache layer indexing to account for mixed layer types.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/turbomind/turbomind.cc Sets linear attention state dtype and blocks prefix-caching when linear attention is present.
src/turbomind/models/llama/unified_attention_layer.h Adds cache_layer_ids_ for remapping layer IDs used by KV cache logic.
src/turbomind/models/llama/unified_attention_layer.cc Builds layer-id remap and uses it in attention decode/prefill params.
src/turbomind/models/llama/moe_ffn_layer.cc Adjusts routing-path selection logic for MoE gating.
src/turbomind/models/llama/llama_params.h Adds linear_state_dtype and helper HasLinearAttention.
src/turbomind/models/llama/gated_delta_net_kernels.h Refactors kernel APIs to Tensor/Buffer reference-based interfaces and adds new batched launchers.
src/turbomind/models/llama/gated_delta_net_kernels.cu Implements new v2/v3 recurrent kernels, chunked prefill kernel, persistent conv1d+SiLU, and refactors helper kernels.
src/turbomind/models/llama/bench_gated_delta_net.cc Adds benchmark/correctness comparison utility for Gated Delta Rule kernels.
src/turbomind/models/llama/bench_conv1d_silu.cc Adds benchmark plus CPU reference correctness checker for conv1d+SiLU kernel.
src/turbomind/models/llama/SequenceManager.h Adds sequence-owned linear attention state fields and pooled-slot bookkeeping.
src/turbomind/models/llama/SequenceManager.cc Implements pooled slot allocation, cache/state invalidation, and adjusts cache-layer accounting for linear layers.
src/turbomind/models/llama/GatedDeltaNetWeight.h Updates conv1d weight layout comment.
src/turbomind/models/llama/GatedDeltaNetWeight.cc Builds fused projection weight and transposes conv1d weights to kernel-preferred layout.
src/turbomind/models/llama/GatedDeltaNetLayer.h Extends per-phase data to include offsets/state ptr arrays and adds dual-stream execution resources.
src/turbomind/models/llama/GatedDeltaNetLayer.cc Switches to pooled sequence states and launches new batched/persistent kernels with mixed decode/prefill scheduling.
src/turbomind/models/CMakeLists.txt Adds CUDA compile flags and registers new benchmark executables under BUILD_TEST.
src/turbomind/kernels/gemm/test/testbed_v3.h Updates LlamaLinear construction usage in tests.
src/turbomind/kernels/gemm/test/test_utils.cu Extends FastCompare dispatch/instantiations to support float.
src/turbomind/kernels/attention/test_attention.cu Adds is_share_kv() to satisfy block layout interface expectations.
src/turbomind/kernels/attention/CMakeLists.txt Fixes test target linkage to depend on models.
src/turbomind/engine/request.h Removes per-request linear attention state fields from RequestCache.
src/turbomind/engine/engine.cc Wires SequenceManager ctor changes, adds stateless guard for linear attention, and integrates pooled state slot acquisition/invalidation.
lmdeploy/turbomind/deploy/source_model/qwen.py Adjusts exported Qwen3.5 MoE routing metadata and exports linear attention parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -367,7 +367,7 @@ def model_info(self):
info['inter_size'] = shared_expert_size
info['moe_shared_gate'] = True
# Qwen3.5 uses sigmoid MoE routing (not softmax)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment says Qwen3.5 uses sigmoid MoE routing, but the code sets info['scoring_func'] = 'softmax'. Please either update the comment to match the implementation, or switch the value back to 'sigmoid' if that is the intended router behavior (and ensure the TurboMind MoE gate path supports it).

Suggested change
# Qwen3.5 uses sigmoid MoE routing (not softmax)
# Qwen3.5 uses softmax MoE routing

Copilot uses AI. Check for mistakes.
@@ -448,7 +448,7 @@ def model_info(self):
info['inter_size'] = cfg.get('shared_expert_intermediate_size', 0)
info['moe_shared_gate'] = True
# Qwen3.5 uses sigmoid MoE routing (not softmax)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment says Qwen3.5 uses sigmoid MoE routing, but the code sets info['scoring_func'] = 'softmax'. Please either update the comment to match the implementation, or switch the value back to 'sigmoid' if that is the intended router behavior (and ensure the TurboMind MoE gate path supports it).

Suggested change
# Qwen3.5 uses sigmoid MoE routing (not softmax)
# Qwen3.5 MoE routing uses softmax scoring

Copilot uses AI. Check for mistakes.
// Gated DeltaNet linear attention persistent states (e.g. Qwen3.5-MoE).
// Allocated on first request, preserved across requests for the same session,
// and freed automatically when the sequence is erased from the SequenceManager.
// conv_states: (num_linear_layers, conv_dim, d_conv) — per-channel rolling conv history
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment describing conv_states shape doesn't match the actual allocation in SequenceManager (pooled_conv_states_ is sized as [max_batch_size, num_linear_layers, d_conv, conv_dim], so per-sequence it is [num_linear_layers, d_conv, conv_dim]). Please update the comment to reflect the correct dimension order to avoid misuse by future callers.

Suggested change
// conv_states: (num_linear_layers, conv_dim, d_conv) — per-channel rolling conv history
// conv_states: (num_linear_layers, d_conv, conv_dim) — per-channel rolling conv history

Copilot uses AI. Check for mistakes.
@lzhangzz
Copy link
Collaborator Author

@tuilakhanh

Failed to load 27B model, run with --tp 2.
[TM][FATAL] models/llama/gated_delta_net_kernels.cu(1249): Check failed: conv_dim % ch_per_blk == 0

fixed in c83d2d7

@lingyezhixing
Copy link

Cannot load Qwen3.5-27B-AWQ on a single V100-SXM2-32G card, it always runs into GPU memory overflow, even with --cache-max-entry-count 0.1 --session-len 2048.

[TM][WARNING] [TM] `max_context_token_num` is not set, default to 2048.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2026-03-20 16:33:46,556 - lmdeploy - WARNING - turbomind.py:246 - get 1197 model params
[TM][ERROR] CUDA runtime error: out of memory D:\LLM\lmdeploy\src\turbomind\core\allocator.cc:49

@lzhangzz
Copy link
Collaborator Author

lzhangzz commented Mar 20, 2026

@lingyezhixing

Cannot load Qwen3.5-27B-AWQ on a single V100-SXM2-32G card, it always runs into GPU memory overflow, even with --cache-max-entry-count 0.1 --session-len 2048.

Try to reduce --max-batch-size, currently all linear states for max batch size is going to be allocated at once. --log-level INFO will print memory usage of linear states & kv cache

@tuilakhanh
Copy link
Contributor

@tuilakhanh

Failed to load 27B model, run with --tp 2.
[TM][FATAL] models/llama/gated_delta_net_kernels.cu(1249): Check failed: conv_dim % ch_per_blk == 0

fixed in c83d2d7

Fixed, successful run 35B-A3B, 122B-A10B-AWQ and 27B with v100. Performance is also improved too much when compare with current master.

@lvhan028
Copy link
Collaborator

Cannot load Qwen3.5-27B-AWQ on a single V100-SXM2-32G card, it always runs into GPU memory overflow, even with --cache-max-entry-count 0.1 --session-len 2048.

[TM][WARNING] [TM] `max_context_token_num` is not set, default to 2048.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2026-03-20 16:33:46,556 - lmdeploy - WARNING - turbomind.py:246 - get 1197 model params
[TM][ERROR] CUDA runtime error: out of memory D:\LLM\lmdeploy\src\turbomind\core\allocator.cc:49

May try the following command:

 lmdeploy serve api_server QuantTrio/Qwen3.5-27B-AWQ --tp 1 --log-level INFO --backend turbomind --max-batch-size 1 --max-prefill-token-num 2048 --cache-max-entry-count 0.75 --session-len 64000

@lingyezhixing
Copy link

Confirmed that it can run, but there's no speed improvement compared to the current main branch. Could this be a specific issue on the Windows platform?

@echo off
set CUDA_VISIBLE_DEVICES=1
chcp 65001

conda activate lmdeploy && lmdeploy serve api_server E:\models\LLM\Qwen3.5-27B-AWQ --tp 1 --log-level INFO --server-name 0.0.0.0 --server-port 8080 --model-name Qwen3.5-27B --backend turbomind --max-batch-size 1 --max-prefill-token-num 2048 --cache-max-entry-count 0.7 --session-len 40960 --tool-call-parser qwen3coder --reasoning-parser qwen-qwq
Log
Active code page: 65001
Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin, please note cuda version should >= 11.3 when compiled with cuda 11
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2026-03-20 18:29:03,469 - lmdeploy - INFO - async_engine.py:105 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='auto', model_format=None, tp=1, dp=1, cp=1, device_num=None, attn_tp_size=None, attn_cp_size=None, attn_dp_size=None, mlp_tp_size=None, mlp_dp_size=None, outer_dp_size=None, nnodes=1, node_rank=0, dist_init_addr=None, devices=None, session_len=40960, max_batch_size=1, cache_max_entry_count=0.7, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=2048, num_tokens_per_iter=0, max_prefill_iters=1, async_=1, empty_init=False, communicator='nccl', hf_overrides=None, enable_metrics=True)
2026-03-20 18:29:03,469 - lmdeploy - INFO - async_engine.py:106 - speculative_config=None
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
`torch_dtype` is deprecated! Use `dtype` instead!
2026-03-20 18:29:05,473 - lmdeploy - INFO - turbomind.py:264 - turbomind model config:

{
  "model_config": {
    "model_name": "",
    "chat_template": "",
    "model_arch": "Qwen3_5ForConditionalGeneration",
    "head_num": 24,
    "kv_head_num": 4,
    "hidden_units": 5120,
    "vocab_size": 248320,
    "embedding_size": 248320,
    "num_layer": 64,
    "inter_size": [
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408,
      17408
    ],
    "norm_eps": 1e-06,
    "attn_bias": 0,
    "mlp_bias": false,
    "window_size": [],
    "attn_sink": false,
    "qk_norm": true,
    "size_per_head": 256,
    "group_size": 128,
    "data_type": "float16",
    "weight_type": "float16",
    "expert_weight_type": "int4",
    "ffn_weight_type": "int4",
    "session_len": 40960,
    "attn_tp_size": 1,
    "attn_cp_size": 1,
    "mlp_tp_size": 1,
    "model_format": "awq",
    "expert_num": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ],
    "expert_router_bias": false,
    "expert_inter_size": 0,
    "experts_per_token": 0,
    "activation_type": "",
    "moe_shared_gate": true,
    "norm_topk_prob": true,
    "routed_scale": 1.0,
    "topk_group": 1,
    "topk_method": "greedy",
    "moe_group_num": 1,
    "scoring_func": "softmax",
    "router_n_groups": -1,
    "q_lora_rank": 0,
    "kv_lora_rank": 0,
    "qk_rope_dim": 0,
    "v_head_dim": 0,
    "layer_types": [
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention",
      "linear_attention",
      "linear_attention",
      "linear_attention",
      "full_attention"
    ],
    "linear_key_head_dim": 128,
    "linear_value_head_dim": 128,
    "linear_conv_kernel_dim": 4,
    "linear_num_key_heads": 16,
    "linear_num_value_heads": 48,
    "attn_output_gate": true,
    "unquantized_expert_layers": [
      0
    ],
    "tune_layer_num": 1
  },
  "attention_config": {
    "softmax_scale": 0.0,
    "cache_block_seq_len": 64,
    "use_logn_attn": 0,
    "max_position_embeddings": 262144,
    "rope_param": {
      "type": "mrope",
      "base": 10000000.0,
      "dim": 64,
      "factor": 1.0,
      "max_position_embeddings": null,
      "attention_factor": 1.0,
      "beta_fast": 32,
      "beta_slow": 1,
      "low_freq_factor": null,
      "high_freq_factor": null,
      "original_max_position_embeddings": null,
      "mrope_section": [
        11,
        11,
        10
      ]
    }
  },
  "lora_config": {
    "lora_policy": "",
    "lora_r": 0,
    "lora_scale": 0.0,
    "lora_max_wo_r": 0,
    "lora_rank_pattern": "",
    "lora_scale_pattern": ""
  },
  "engine_config": {
    "dtype": "auto",
    "model_format": "awq",
    "tp": 1,
    "dp": 1,
    "cp": 1,
    "device_num": 1,
    "attn_tp_size": 1,
    "attn_cp_size": 1,
    "attn_dp_size": 1,
    "mlp_tp_size": 1,
    "mlp_dp_size": 1,
    "outer_dp_size": 1,
    "nnodes": 1,
    "node_rank": 0,
    "dist_init_addr": null,
    "devices": [
      0
    ],
    "session_len": 40960,
    "max_batch_size": 1,
    "cache_max_entry_count": 0.7,
    "cache_chunk_size": -1,
    "cache_block_seq_len": 64,
    "enable_prefix_caching": false,
    "quant_policy": 0,
    "rope_scaling_factor": 0.0,
    "use_logn_attn": false,
    "download_dir": null,
    "revision": null,
    "max_prefill_token_num": 2048,
    "num_tokens_per_iter": 0,
    "max_prefill_iters": 1,
    "async_": 1,
    "empty_init": false,
    "communicator": "nccl",
    "hf_overrides": null,
    "enable_metrics": true
  }
}
[TM][WARNING] [TM] `max_context_token_num` is not set, default to 40960.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type qwen3_5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
2026-03-20 18:29:06,193 - lmdeploy - WARNING - turbomind.py:246 - get 1197 model params
[TM][INFO] GatedDeltaNetLayer: num_k=16 num_v=48 k_dim=2048 v_dim=6144 conv_dim=10240 d_conv=4 num_linear_layers=48
[TM][INFO] [SeqMgr] linear-state slot pool initialized: 1 slots
[TM][INFO] [SeqMgr] linear-state per slot: conv 3.75 MB + recurrent 72.00 MB = 75.75 MB
[TM][INFO] [SeqMgr] linear-state combined total: 75.75 MB
[TM][INFO] [SeqMgr] Adjusting block_count: free_before 3062.31 MB, linear 75.75 MB, target 2143.62 MB
[TM][INFO] [SeqMgr] Adjusted block_count to 517
[TM][INFO] [BlockManager] block_size = 4.000 MB
[TM][INFO] [BlockManager] max_block_count = 516
[TM][INFO] [BlockManager] chunk_size = 516
[TM][WARNING] [SegMgr] prefix caching is disabled
[TM][INFO] max cached tokens: 33024
[TM][WARNING] `session_len` truncated to 33024 due to limited KV cache memory
[TM][INFO] set threshold 1 -> 1
[TM][INFO] [Engine] Warm-up lengths: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 2049
[TM][INFO] [WarmUp] 8
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 16
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 32
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 48
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 64
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 96
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 128
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 192
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 256
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 384
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 512
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 768
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 1024
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 1536
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 2048
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] 2049
[TM][INFO] [SeqMgr][Create] ID 0
[TM][INFO] [WarmUp] Warm-up finished in 0.46 seconds.
[TM][INFO] set threshold 1 -> 1
2026-03-20 18:29:40,666 - lmdeploy - INFO - async_engine.py:133 - updated backend_config=TurbomindEngineConfig(dtype='auto', model_format='awq', tp=1, dp=1, cp=1, device_num=1, attn_tp_size=1, attn_cp_size=1, attn_dp_size=1, mlp_tp_size=1, mlp_dp_size=1, outer_dp_size=1, nnodes=1, node_rank=0, dist_init_addr=None, devices=[0], session_len=40960, max_batch_size=1, cache_max_entry_count=0.7, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=2048, num_tokens_per_iter=0, max_prefill_iters=1, async_=1, empty_init=False, communicator='nccl', hf_overrides=None, enable_metrics=True)
2026-03-20 18:29:41,065 - lmdeploy - INFO - async_engine.py:185 - enable metrics, with dp: 1 dp_rank: 0
HINT:    Please open http://0.0.0.0:8080 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:8080 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:8080 in a browser for detailed api usage!!!
INFO:     Started server process [28312]
INFO:     Waiting for application startup.
2026-03-20 18:29:41,110 - lmdeploy - INFO - metrics_processor.py:31 - Metrics handler task started.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
2026-03-20 18:29:45,772 - lmdeploy - INFO - session_manager.py:208 - [SessionManager] session 1 not found. Creating...
INFO:     192.168.50.10:34414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2026-03-20 18:29:45,783 - lmdeploy - INFO - logger.py:45 - session=Session(id=1, step=0), adapter_name=None, input_tokens=1438, gen_config=GenerationConfig(n=1, max_new_tokens=None, do_sample=True, top_p=1.0, top_k=40, min_p=0.0, temperature=1.0, repetition_penalty=1.0, ignore_eos=False, random_seed=None, stop_words=None, bad_words=None, stop_token_ids=None, bad_token_ids=None, min_new_tokens=None, skip_special_tokens=False, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None, include_stop_str_in_output=False, with_cache=False, preserve_cache=False, migration_request=None, return_routed_experts=False, repetition_ngram_size=0, repetition_ngram_threshold=0), prompt='<|im_start|>system\n# Tools\n\nYou have access to the following functions:\n\n<tools>\n{"description": "Get the current Unix timestamp in seconds.", "name": "get_current_timestamp", "parameters": {"properties": {}, "type": "object"}}\n{"description": "Get the current Unix timestamp, optionally adjusted by days, weeks, months, or years.\\nUse this to calculate timestamps for date filtering in search functions.\\nExamples: \\"last week\\" = weeks_ago=1, \\"3 days ago\\" = days_ago=3, \\"a year ago\\" = years_ago=1", "name": "calculate_timestamp", "parameters": {"properties": {"days_ago": {"default": 0, "description": "Number of days to subtract from current time (default: 0)", "type": "integer"}, "weeks_ago": {"default": 0, "description": "Number of weeks to subtract from current time (default: 0)", "type": "integer"}, "months_ago": {"default": 0, "description": "Number of months to subtract from current time (default: 0)", "type": "integer"}, "years_ago": {"default": 0, "description": "Number of years to subtract from current time (default: 0)", "type": "integer"}}, "type": "object"}}\n{"description": "Search the user\'s notes by title and content.", "name": "search_notes", "parameters": {"properties": {"query": {"description": "The search query to find matching notes", "type": "string"}, "count": {"default": 5, "description": "Maximum number of results to return (default: 5)", "type": "integer"}, "start_timestamp": {"description": "Only include notes updated after this Unix timestamp (seconds)", "type": "integer"}, "end_timestamp": {"description": "Only include notes updated before this Unix timestamp (seconds)", "type": "integer"}}, "required": ["query"], "type": "object"}}\n{"description": "Get the full content of a note by its ID.", "name": "view_note", "parameters": {"properties": {"note_id": {"description": "The ID of the note to retrieve", "type": "string"}}, "required": ["note_id"], "type": "object"}}\n{"description": "Create a new note with the given title and content.", "name": "write_note", "parameters": {"properties": {"title": {"description": "The title of the new note", "type": "string"}, "content": {"description": "The markdown content for the note", "type": "string"}}, "required": ["title", "content"], "type": "object"}}\n{"description": "Update the content of a note. Use this to modify task lists, add notes, or update content.", "name": "replace_note_content", "parameters": {"properties": {"note_id": {"description": "The ID of the note to update", "type": "string"}, "content": {"description": "The new markdown content for the note", "type": "string"}, "title": {"description": "Optional new title for the note", "type": "string"}}, "required": ["note_id", "content"], "type": "object"}}\n</tools>\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>\n\n# 角色与目标\n你是一个具备多种工具调用能力的智能助手。你必须严格遵守以下工具使用规范。\n\n# 工具缺失(权限问题)\n部分工具默认关闭。如果评估任务需要使用某个工具但在列表中找不到它,你需要明确向用户说明需要开启哪个工具的权限,待用户回复开放后,重新检查列表并继续执行任务。\n\n# 核心基本原则\n1. **优先内部知识**:一般情况下,优先使用你自身的内部知识库完成对话,仅在知识缺失或有明确需求时才调用工具。\n2. **禁止过度响应**:如果没有明确的指令,绝不擅自使用笔记、图像或代码工具。\n3. **规范信息获取**:如果自身知识缺失,优先使用网络搜索获取公开信息;绝对禁止把笔记当作外部知识库进行搜索,除非有明确指令指定搜索笔记。\n\n# 工具使用规范指引\n\n### 1. 笔记管理 (Notes)\n- **绝对被动触发**:仅在用户明确发出“记录”、“搜索我的笔记”、“查看笔记”等指令时,才能调用笔记工具。\n\n### 2. 网络搜索与阅读 (Web Search)\n- **时效信息时间前置**:当需要获取强时效性信息(如最新新闻、前沿科技、实时政治等)时,**必须**先调用时间工具获取当前实时时间或换算目标时间,并将其(如年月信息)加入到搜索关键词中。\n- **强制深度阅读**:调用 `search_web` 获取搜索结果后,**禁止**仅依靠简述回答问题。必须挑选最相关的几条结果,调用 `fetch_url` 获取网页完整内容后再进行作答。\n\n### 3. 数学与代码计算 (Code Execution)\n- **精确运算**:在解答数学题、进行数据处理或复杂逻辑计算时,必须调用 `execute_code` 工具在 Python 环境中进行运算,以确保结果绝对精确(注意:仅限 Python 标准库)。\n\n### 4. 图像生成与编辑 (Image Processing)\n- **图像生成 (`generate_image`)**:\n  - 提示词必须极具画面感且细节丰富,详细描述颜色、形状及重要元素(像给盲人描述一样)。\n  - 忠于上下文,不臆造不存在的信息;如遇复杂场景,集中描述最显著的核心元素。\n  - 支持中英文,无特定要求时默认使用中文。\n- **图像编辑 (`edit_image`)**:\n  - 必须极其精准地描述**需要修改的具体局部及其新样貌**。\n  - 必须在提示词中强调**“保持图像其余所有部分完全不变”**。<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n<think>\n', prompt_token_id=[248045, 8678, 198, 2, 13455, 271, 2523, 599, 2528, 310, 279, 2614, 5568, 25, 271, 27, 15449, 29, 198, 4754, 4532, 763, 328, 1882, 279, 1428, 45426, 11112, 303, 6283, 10152, 328, 591, 763, 328, 447, 10757, 22355, 487, 328, 13390, 763, 5046, 12811, 763, 15969, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 1882, 279, 1428, 45426, 11112, 11, 44007, 22643, 539, 2756, 11, 5381, 11, 3818, 11, 466, 1578, 6889, 77, 9947, 411, 310, 10724, 47149, 364, 2321, 28701, 303, 2624, 5568, 6889, 77, 39044, 25, 7018, 4119, 1936, 2037, 283, 5381, 62, 6106, 28, 16, 11, 7018, 18, 2756, 3998, 2037, 283, 2756, 62, 6106, 28, 18, 11, 7018, 64, 1007, 3998, 2037, 283, 1578, 62, 6106, 28, 16, 487, 328, 591, 763, 328, 34420, 22355, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 13382, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 2756, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 78196, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 5381, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 48041, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 3818, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 40342, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 1578, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 8934, 2069, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 5708, 279, 1156, 579, 8129, 539, 2192, 321, 2144, 10152, 328, 591, 763, 328, 1773, 43632, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 1574, 763, 5046, 4532, 763, 328, 760, 2624, 3134, 310, 1423, 12219, 8129, 487, 328, 1267, 763, 328, 889, 13933, 328, 1767, 763, 5046, 2186, 763, 220, 20, 11, 328, 4532, 763, 328, 26427, 1324, 314, 2961, 310, 460, 318, 2186, 25, 220, 20, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 2388, 22355, 763, 5046, 4532, 763, 328, 7081, 2830, 8129, 5860, 1238, 411, 45426, 11112, 318, 16890, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 400, 22355, 763, 5046, 4532, 763, 328, 7081, 2830, 8129, 5860, 1518, 411, 45426, 11112, 318, 16890, 11250, 328, 1267, 763, 328, 11326, 8934, 2069, 328, 6081, 763, 4241, 1574, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 1882, 279, 2400, 2144, 314, 264, 5020, 539, 1141, 2937, 10152, 328, 591, 763, 328, 1015, 26331, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 9679, 816, 763, 5046, 4532, 763, 328, 760, 2937, 314, 279, 5020, 310, 16672, 487, 328, 1267, 763, 328, 889, 8934, 2069, 328, 6081, 763, 4241, 9679, 816, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 3886, 264, 491, 5020, 440, 279, 2574, 2192, 321, 2144, 10152, 328, 591, 763, 328, 4775, 26331, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 2034, 763, 5046, 4532, 763, 328, 760, 2192, 314, 279, 491, 5020, 487, 328, 1267, 763, 328, 889, 13933, 328, 1733, 763, 5046, 4532, 763, 328, 760, 48794, 2144, 364, 279, 5020, 487, 328, 1267, 763, 328, 889, 8934, 2069, 328, 6081, 763, 4241, 2034, 487, 328, 1733, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 4149, 279, 2144, 314, 264, 5020, 13, 5272, 411, 310, 5427, 3274, 11140, 11, 884, 8129, 11, 466, 2560, 2144, 10152, 328, 591, 763, 328, 7899, 26331, 7260, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 9679, 816, 763, 5046, 4532, 763, 328, 760, 2937, 314, 279, 5020, 310, 2560, 487, 328, 1267, 763, 328, 889, 13933, 328, 1733, 763, 5046, 4532, 763, 328, 760, 491, 48794, 2144, 364, 279, 5020, 487, 328, 1267, 763, 328, 889, 13933, 328, 2034, 763, 5046, 4532, 763, 328, 14863, 491, 2192, 364, 279, 5020, 487, 328, 1267, 763, 328, 889, 8934, 2069, 328, 6081, 763, 4241, 9679, 816, 487, 328, 1733, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 510, 15449, 29, 271, 2592, 488, 4992, 310, 1562, 264, 709, 25835, 9559, 303, 279, 2614, 3443, 440, 5486, 19900, 25, 271, 248058, 198, 27, 1628, 28, 8422, 8901, 1224, 29, 198, 27, 15704, 28, 8422, 24109, 62, 16, 29, 198, 927, 62, 16, 198, 510, 15704, 29, 198, 27, 15704, 28, 8422, 24109, 62, 17, 29, 198, 1919, 369, 279, 869, 364, 279, 2018, 5555, 198, 8761, 628, 9111, 198, 34493, 4965, 198, 510, 15704, 29, 198, 510, 1628, 29, 198, 248059, 271, 27, 95328, 29, 198, 92065, 25, 198, 12, 5534, 6526, 26834, 1732, 279, 5024, 3443, 25, 449, 8906, 361, 1628, 28, 1076, 1419, 1628, 29, 2424, 1902, 381, 23283, 2785, 220, 248058, 248059, 11535, 9212, 198, 12, 12296, 4868, 26834, 381, 5024, 198, 12, 1394, 1189, 3300, 9801, 31626, 364, 678, 709, 1562, 303, 5629, 3992, 54588, 279, 709, 1562, 11, 694, 4045, 1238, 198, 12, 1368, 1017, 369, 874, 709, 1562, 2420, 11, 4087, 279, 3296, 1040, 4472, 440, 678, 1428, 6337, 321, 635, 524, 3184, 279, 1156, 883, 709, 6526, 198, 510, 95328, 29, 271, 2, 220, 100561, 95999, 97622, 198, 133653, 98897, 99163, 99445, 109185, 104426, 98409, 110257, 1710, 122117, 115334, 97155, 99445, 96402, 97816, 1710, 271, 2, 220, 99445, 112209, 9616, 104437, 96304, 7313, 198, 96985, 99445, 101908, 100663, 1710, 96604, 99973, 97995, 124699, 109300, 99445, 108750, 145211, 109604, 96540, 3709, 111083, 98290, 96127, 97237, 98419, 96442, 103750, 101857, 134496, 104437, 3709, 96739, 97237, 100535, 98865, 95946, 3709, 100544, 97552, 101831, 96172, 97625, 97328, 97995, 1710, 271, 2, 220, 98144, 130303, 198, 16, 13, 2972, 99575, 98866, 96905, 332, 4960, 113228, 3709, 99575, 96402, 95933, 108920, 98866, 148192, 97235, 104062, 3709, 129555, 96905, 112209, 114503, 98290, 97887, 140092, 109185, 99445, 1710, 198, 17, 13, 2972, 103823, 109212, 101928, 332, 4960, 109842, 113149, 109454, 3709, 116956, 105491, 96402, 100089, 5205, 101878, 96348, 98874, 99445, 1710, 198, 18, 13, 2972, 97816, 96280, 100552, 332, 4960, 96604, 98647, 96905, 112209, 3709, 99575, 96402, 97034, 98287, 100552, 98026, 96280, 24178, 99547, 103823, 96373, 100089, 111653, 104204, 148192, 107305, 3709, 110237, 95762, 98290, 109454, 103864, 98287, 100089, 1710, 271, 2, 220, 99445, 96402, 97816, 104698, 271, 13962, 220, 16, 13, 220, 100089, 96205, 318, 21003, 8, 198, 12, 2972, 99547, 110689, 110852, 332, 4960, 129555, 97237, 98290, 108632, 2005, 98151, 828, 88786, 98287, 96933, 100089, 828, 88786, 97239, 100089, 828, 96024, 109454, 95865, 3709, 97558, 109185, 100089, 99445, 1710, 271, 13962, 220, 17, 13, 220, 97034, 98287, 95999, 97075, 318, 5793, 7304, 8, 198, 12, 2972, 105916, 96280, 96341, 115843, 332, 4960, 96129, 96442, 100552, 96189, 105916, 95911, 96280, 9616, 96031, 97899, 97280, 5205, 105119, 96848, 5205, 101901, 97905, 96024, 7313, 95865, 3709, 332, 97240, 332, 96218, 109185, 96341, 99445, 100552, 99927, 101901, 96341, 96348, 130863, 97622, 96341, 3709, 123118, 9616, 96031, 120891, 96280, 7313, 121152, 98287, 101834, 95789, 1710, 198, 12, 2972, 100915, 99581, 97075, 332, 4960, 109185, 1510, 1773, 25130, 63, 220, 100552, 142093, 95946, 3709, 332, 103823, 332, 96725, 108965, 125328, 123058, 1710, 97240, 109289, 96019, 108598, 127883, 97989, 3709, 109185, 1510, 9353, 2809, 63, 220, 100552, 109007, 99492, 96621, 109717, 96335, 124603, 1710, 271, 13962, 220, 18, 13, 220, 103711, 95999, 98874, 97792, 318, 2010, 29913, 8, 198, 12, 2972, 110248, 104983, 332, 4960, 95772, 101776, 103711, 96040, 5205, 96335, 122971, 96348, 99975, 101842, 97792, 95865, 3709, 97240, 109185, 1510, 9951, 4000, 63, 220, 99445, 95772, 12654, 220, 111937, 96335, 104983, 3709, 112035, 97989, 99547, 110248, 9616, 97120, 4960, 113914, 12654, 220, 97427, 97247, 71748, 271, 13962, 220, 19, 13, 220, 101878, 103910, 95999, 98428, 318, 1841, 27212, 8, 198, 12, 2972, 101878, 103910, 27718, 18779, 4794, 63, 31230, 4960, 198, 220, 471, 220, 99162, 96919, 97240, 112782, 103926, 96111, 97447, 101751, 97502, 3709, 145290, 101617, 5205, 102137, 96128, 96588, 100553, 9616, 96647, 96237, 132896, 99172, 98607, 71748, 198, 220, 471, 220, 125383, 129933, 3709, 95753, 95804, 228, 96238, 110204, 108674, 24178, 120306, 99975, 100653, 3709, 98445, 99172, 96019, 100120, 109446, 100553, 1710, 198, 220, 471, 220, 97273, 134710, 3709, 95979, 109054, 96719, 95865, 101908, 96402, 99986, 1710, 198, 12, 2972, 101878, 98428, 27718, 3468, 4794, 63, 31230, 4960, 198, 220, 471, 220, 97240, 110076, 101804, 95852, 99172, 332, 96442, 103723, 104450, 108813, 101625, 95882, 96206, 98223, 221794, 198, 220, 471, 220, 112787, 99162, 96919, 95789, 99989, 332, 2005, 97559, 101878, 109571, 96983, 96985, 97898, 102069, 173464, 1710, 248046, 198, 248045, 846, 198, 109266, 248046, 198, 248045, 74455, 198, 248068, 198]
2026-03-20 18:29:45,784 - lmdeploy - INFO - async_engine.py:382 - session=1, history_tokens=0, input_tokens=1438, max_new_tokens=39522, seq_start=True, seq_end=True, step=0, prep=True
2026-03-20 18:29:45,784 - lmdeploy - INFO - turbomind.py:687 - [async_stream_infer] session 1 start
[TM][INFO] [SeqMgr][Create] ID 1
[TM][WARNING] [ProcessInferRequests] [1] total sequence length (1438 + 39522) exceeds `session_len` (33024), `max_new_tokens` is truncated to 31586
[2026-03-20 18:29:51 DP0] Avg thr (in/out): 143.2 / 2.6 tokens/s, API server (completed/routed/waiting): 0 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.5%,
[2026-03-20 18:30:01 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 0 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.7%,
2026-03-20 18:30:02,404 - lmdeploy - INFO - turbomind.py:797 - [async_stream_infer] session 1 done
2026-03-20 18:30:02,404 - lmdeploy - INFO - async_engine.py:505 - session 1 finished, reason "stop", input_tokens 1438, output_tokens 98
INFO:     192.168.50.10:56832 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
INFO:     192.168.50.10:56834 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
[2026-03-20 18:30:11 DP0] Avg thr (in/out): 0.0 / 0.9 tokens/s, API server (completed/routed/waiting): 1 / 0 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.8%,
2026-03-20 18:30:53,392 - lmdeploy - INFO - session_manager.py:208 - [SessionManager] session 2 not found. Creating...
INFO:     192.168.50.10:55218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2026-03-20 18:30:53,396 - lmdeploy - INFO - logger.py:45 - session=Session(id=2, step=0), adapter_name=None, input_tokens=1497, gen_config=GenerationConfig(n=1, max_new_tokens=None, do_sample=True, top_p=1.0, top_k=40, min_p=0.0, temperature=1.0, repetition_penalty=1.0, ignore_eos=False, random_seed=None, stop_words=None, bad_words=None, stop_token_ids=None, bad_token_ids=None, min_new_tokens=None, skip_special_tokens=False, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None, include_stop_str_in_output=False, with_cache=False, preserve_cache=False, migration_request=None, return_routed_experts=False, repetition_ngram_size=0, repetition_ngram_threshold=0), prompt='<|im_start|>system\n# Tools\n\nYou have access to the following functions:\n\n<tools>\n{"description": "Get the current Unix timestamp in seconds.", "name": "get_current_timestamp", "parameters": {"properties": {}, "type": "object"}}\n{"description": "Get the current Unix timestamp, optionally adjusted by days, weeks, months, or years.\\nUse this to calculate timestamps for date filtering in search functions.\\nExamples: \\"last week\\" = weeks_ago=1, \\"3 days ago\\" = days_ago=3, \\"a year ago\\" = years_ago=1", "name": "calculate_timestamp", "parameters": {"properties": {"days_ago": {"default": 0, "description": "Number of days to subtract from current time (default: 0)", "type": "integer"}, "weeks_ago": {"default": 0, "description": "Number of weeks to subtract from current time (default: 0)", "type": "integer"}, "months_ago": {"default": 0, "description": "Number of months to subtract from current time (default: 0)", "type": "integer"}, "years_ago": {"default": 0, "description": "Number of years to subtract from current time (default: 0)", "type": "integer"}}, "type": "object"}}\n{"description": "Search the user\'s notes by title and content.", "name": "search_notes", "parameters": {"properties": {"query": {"description": "The search query to find matching notes", "type": "string"}, "count": {"default": 5, "description": "Maximum number of results to return (default: 5)", "type": "integer"}, "start_timestamp": {"description": "Only include notes updated after this Unix timestamp (seconds)", "type": "integer"}, "end_timestamp": {"description": "Only include notes updated before this Unix timestamp (seconds)", "type": "integer"}}, "required": ["query"], "type": "object"}}\n{"description": "Get the full content of a note by its ID.", "name": "view_note", "parameters": {"properties": {"note_id": {"description": "The ID of the note to retrieve", "type": "string"}}, "required": ["note_id"], "type": "object"}}\n{"description": "Create a new note with the given title and content.", "name": "write_note", "parameters": {"properties": {"title": {"description": "The title of the new note", "type": "string"}, "content": {"description": "The markdown content for the note", "type": "string"}}, "required": ["title", "content"], "type": "object"}}\n{"description": "Update the content of a note. Use this to modify task lists, add notes, or update content.", "name": "replace_note_content", "parameters": {"properties": {"note_id": {"description": "The ID of the note to update", "type": "string"}, "content": {"description": "The new markdown content for the note", "type": "string"}, "title": {"description": "Optional new title for the note", "type": "string"}}, "required": ["note_id", "content"], "type": "object"}}\n</tools>\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>\n\n# 角色与目标\n你是一个具备多种工具调用能力的智能助手。你必须严格遵守以下工具使用规范。\n\n# 工具缺失(权限问题)\n部分工具默认关闭。如果评估任务需要使用某个工具但在列表中找不到它,你需要明确向用户说明需要开启哪个工具的权限,待用户回复开放后,重新检查列表并继续执行任务。\n\n# 核心基本原则\n1. **优先内部知识**:一般情况下,优先使用你自身的内部知识库完成对话,仅在知识缺失或有明确需求时才调用工具。\n2. **禁止过度响应**:如果没有明确的指令,绝不擅自使用笔记、图像或代码工具。\n3. **规范信息获取**:如果自身知识缺失,优先使用网络搜索获取公开信息;绝对禁止把笔记当作外部知识库进行搜索,除非有明确指令指定搜索笔记。\n\n# 工具使用规范指引\n\n### 1. 笔记管理 (Notes)\n- **绝对被动触发**:仅在用户明确发出“记录”、“搜索我的笔记”、“查看笔记”等指令时,才能调用笔记工具。\n\n### 2. 网络搜索与阅读 (Web Search)\n- **时效信息时间前置**:当需要获取强时效性信息(如最新新闻、前沿科技、实时政治等)时,**必须**先调用时间工具获取当前实时时间或换算目标时间,并将其(如年月信息)加入到搜索关键词中。\n- **强制深度阅读**:调用 `search_web` 获取搜索结果后,**禁止**仅依靠简述回答问题。必须挑选最相关的几条结果,调用 `fetch_url` 获取网页完整内容后再进行作答。\n\n### 3. 数学与代码计算 (Code Execution)\n- **精确运算**:在解答数学题、进行数据处理或复杂逻辑计算时,必须调用 `execute_code` 工具在 Python 环境中进行运算,以确保结果绝对精确(注意:仅限 Python 标准库)。\n\n### 4. 图像生成与编辑 (Image Processing)\n- **图像生成 (`generate_image`)**:\n  - 提示词必须极具画面感且细节丰富,详细描述颜色、形状及重要元素(像给盲人描述一样)。\n  - 忠于上下文,不臆造不存在的信息;如遇复杂场景,集中描述最显著的核心元素。\n  - 支持中英文,无特定要求时默认使用中文。\n- **图像编辑 (`edit_image`)**:\n  - 必须极其精准地描述**需要修改的具体局部及其新样貌**。\n  - 必须在提示词中强调**“保持图像其余所有部分完全不变”**。<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n你好!👋 很高兴见到你。\n\n有什么我可以帮助你的吗?无论是回答问题、协助写作、解答疑问,还是其他任何需求,随时告诉我!<|im_end|>\n<|im_start|>user\ngithubissue可展开markdown代码块怎么写?默认折叠,我要放日志<|im_end|>\n<|im_start|>assistant\n<think>\n', prompt_token_id=[248045, 8678, 198, 2, 13455, 271, 2523, 599, 2528, 310, 279, 2614, 5568, 25, 271, 27, 15449, 29, 198, 4754, 4532, 763, 328, 1882, 279, 1428, 45426, 11112, 303, 6283, 10152, 328, 591, 763, 328, 447, 10757, 22355, 487, 328, 13390, 763, 5046, 12811, 763, 15969, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 1882, 279, 1428, 45426, 11112, 11, 44007, 22643, 539, 2756, 11, 5381, 11, 3818, 11, 466, 1578, 6889, 77, 9947, 411, 310, 10724, 47149, 364, 2321, 28701, 303, 2624, 5568, 6889, 77, 39044, 25, 7018, 4119, 1936, 2037, 283, 5381, 62, 6106, 28, 16, 11, 7018, 18, 2756, 3998, 2037, 283, 2756, 62, 6106, 28, 18, 11, 7018, 64, 1007, 3998, 2037, 283, 1578, 62, 6106, 28, 16, 487, 328, 591, 763, 328, 34420, 22355, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 13382, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 2756, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 78196, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 5381, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 48041, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 3818, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 40342, 62, 6106, 763, 5046, 2186, 763, 220, 15, 11, 328, 4532, 763, 328, 2742, 314, 1578, 310, 31192, 494, 1428, 854, 318, 2186, 25, 220, 15, 11250, 328, 1267, 763, 328, 11326, 8934, 2069, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 5708, 279, 1156, 579, 8129, 539, 2192, 321, 2144, 10152, 328, 591, 763, 328, 1773, 43632, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 1574, 763, 5046, 4532, 763, 328, 760, 2624, 3134, 310, 1423, 12219, 8129, 487, 328, 1267, 763, 328, 889, 13933, 328, 1767, 763, 5046, 2186, 763, 220, 20, 11, 328, 4532, 763, 328, 26427, 1324, 314, 2961, 310, 460, 318, 2186, 25, 220, 20, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 2388, 22355, 763, 5046, 4532, 763, 328, 7081, 2830, 8129, 5860, 1238, 411, 45426, 11112, 318, 16890, 11250, 328, 1267, 763, 328, 11326, 13933, 328, 400, 22355, 763, 5046, 4532, 763, 328, 7081, 2830, 8129, 5860, 1518, 411, 45426, 11112, 318, 16890, 11250, 328, 1267, 763, 328, 11326, 8934, 2069, 328, 6081, 763, 4241, 1574, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 1882, 279, 2400, 2144, 314, 264, 5020, 539, 1141, 2937, 10152, 328, 591, 763, 328, 1015, 26331, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 9679, 816, 763, 5046, 4532, 763, 328, 760, 2937, 314, 279, 5020, 310, 16672, 487, 328, 1267, 763, 328, 889, 8934, 2069, 328, 6081, 763, 4241, 9679, 816, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 3886, 264, 491, 5020, 440, 279, 2574, 2192, 321, 2144, 10152, 328, 591, 763, 328, 4775, 26331, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 2034, 763, 5046, 4532, 763, 328, 760, 2192, 314, 279, 491, 5020, 487, 328, 1267, 763, 328, 889, 13933, 328, 1733, 763, 5046, 4532, 763, 328, 760, 48794, 2144, 364, 279, 5020, 487, 328, 1267, 763, 328, 889, 8934, 2069, 328, 6081, 763, 4241, 2034, 487, 328, 1733, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 4754, 4532, 763, 328, 4149, 279, 2144, 314, 264, 5020, 13, 5272, 411, 310, 5427, 3274, 11140, 11, 884, 8129, 11, 466, 2560, 2144, 10152, 328, 591, 763, 328, 7899, 26331, 7260, 487, 328, 13390, 763, 5046, 12811, 763, 5046, 9679, 816, 763, 5046, 4532, 763, 328, 760, 2937, 314, 279, 5020, 310, 2560, 487, 328, 1267, 763, 328, 889, 13933, 328, 1733, 763, 5046, 4532, 763, 328, 760, 491, 48794, 2144, 364, 279, 5020, 487, 328, 1267, 763, 328, 889, 13933, 328, 2034, 763, 5046, 4532, 763, 328, 14863, 491, 2192, 364, 279, 5020, 487, 328, 1267, 763, 328, 889, 8934, 2069, 328, 6081, 763, 4241, 9679, 816, 487, 328, 1733, 7664, 328, 1267, 763, 328, 1640, 29958, 198, 510, 15449, 29, 271, 2592, 488, 4992, 310, 1562, 264, 709, 25835, 9559, 303, 279, 2614, 3443, 440, 5486, 19900, 25, 271, 248058, 198, 27, 1628, 28, 8422, 8901, 1224, 29, 198, 27, 15704, 28, 8422, 24109, 62, 16, 29, 198, 927, 62, 16, 198, 510, 15704, 29, 198, 27, 15704, 28, 8422, 24109, 62, 17, 29, 198, 1919, 369, 279, 869, 364, 279, 2018, 5555, 198, 8761, 628, 9111, 198, 34493, 4965, 198, 510, 15704, 29, 198, 510, 1628, 29, 198, 248059, 271, 27, 95328, 29, 198, 92065, 25, 198, 12, 5534, 6526, 26834, 1732, 279, 5024, 3443, 25, 449, 8906, 361, 1628, 28, 1076, 1419, 1628, 29, 2424, 1902, 381, 23283, 2785, 220, 248058, 248059, 11535, 9212, 198, 12, 12296, 4868, 26834, 381, 5024, 198, 12, 1394, 1189, 3300, 9801, 31626, 364, 678, 709, 1562, 303, 5629, 3992, 54588, 279, 709, 1562, 11, 694, 4045, 1238, 198, 12, 1368, 1017, 369, 874, 709, 1562, 2420, 11, 4087, 279, 3296, 1040, 4472, 440, 678, 1428, 6337, 321, 635, 524, 3184, 279, 1156, 883, 709, 6526, 198, 510, 95328, 29, 271, 2, 220, 100561, 95999, 97622, 198, 133653, 98897, 99163, 99445, 109185, 104426, 98409, 110257, 1710, 122117, 115334, 97155, 99445, 96402, 97816, 1710, 271, 2, 220, 99445, 112209, 9616, 104437, 96304, 7313, 198, 96985, 99445, 101908, 100663, 1710, 96604, 99973, 97995, 124699, 109300, 99445, 108750, 145211, 109604, 96540, 3709, 111083, 98290, 96127, 97237, 98419, 96442, 103750, 101857, 134496, 104437, 3709, 96739, 97237, 100535, 98865, 95946, 3709, 100544, 97552, 101831, 96172, 97625, 97328, 97995, 1710, 271, 2, 220, 98144, 130303, 198, 16, 13, 2972, 99575, 98866, 96905, 332, 4960, 113228, 3709, 99575, 96402, 95933, 108920, 98866, 148192, 97235, 104062, 3709, 129555, 96905, 112209, 114503, 98290, 97887, 140092, 109185, 99445, 1710, 198, 17, 13, 2972, 103823, 109212, 101928, 332, 4960, 109842, 113149, 109454, 3709, 116956, 105491, 96402, 100089, 5205, 101878, 96348, 98874, 99445, 1710, 198, 18, 13, 2972, 97816, 96280, 100552, 332, 4960, 96604, 98647, 96905, 112209, 3709, 99575, 96402, 97034, 98287, 100552, 98026, 96280, 24178, 99547, 103823, 96373, 100089, 111653, 104204, 148192, 107305, 3709, 110237, 95762, 98290, 109454, 103864, 98287, 100089, 1710, 271, 2, 220, 99445, 96402, 97816, 104698, 271, 13962, 220, 16, 13, 220, 100089, 96205, 318, 21003, 8, 198, 12, 2972, 99547, 110689, 110852, 332, 4960, 129555, 97237, 98290, 108632, 2005, 98151, 828, 88786, 98287, 96933, 100089, 828, 88786, 97239, 100089, 828, 96024, 109454, 95865, 3709, 97558, 109185, 100089, 99445, 1710, 271, 13962, 220, 17, 13, 220, 97034, 98287, 95999, 97075, 318, 5793, 7304, 8, 198, 12, 2972, 105916, 96280, 96341, 115843, 332, 4960, 96129, 96442, 100552, 96189, 105916, 95911, 96280, 9616, 96031, 97899, 97280, 5205, 105119, 96848, 5205, 101901, 97905, 96024, 7313, 95865, 3709, 332, 97240, 332, 96218, 109185, 96341, 99445, 100552, 99927, 101901, 96341, 96348, 130863, 97622, 96341, 3709, 123118, 9616, 96031, 120891, 96280, 7313, 121152, 98287, 101834, 95789, 1710, 198, 12, 2972, 100915, 99581, 97075, 332, 4960, 109185, 1510, 1773, 25130, 63, 220, 100552, 142093, 95946, 3709, 332, 103823, 332, 96725, 108965, 125328, 123058, 1710, 97240, 109289, 96019, 108598, 127883, 97989, 3709, 109185, 1510, 9353, 2809, 63, 220, 100552, 109007, 99492, 96621, 109717, 96335, 124603, 1710, 271, 13962, 220, 18, 13, 220, 103711, 95999, 98874, 97792, 318, 2010, 29913, 8, 198, 12, 2972, 110248, 104983, 332, 4960, 95772, 101776, 103711, 96040, 5205, 96335, 122971, 96348, 99975, 101842, 97792, 95865, 3709, 97240, 109185, 1510, 9951, 4000, 63, 220, 99445, 95772, 12654, 220, 111937, 96335, 104983, 3709, 112035, 97989, 99547, 110248, 9616, 97120, 4960, 113914, 12654, 220, 97427, 97247, 71748, 271, 13962, 220, 19, 13, 220, 101878, 103910, 95999, 98428, 318, 1841, 27212, 8, 198, 12, 2972, 101878, 103910, 27718, 18779, 4794, 63, 31230, 4960, 198, 220, 471, 220, 99162, 96919, 97240, 112782, 103926, 96111, 97447, 101751, 97502, 3709, 145290, 101617, 5205, 102137, 96128, 96588, 100553, 9616, 96647, 96237, 132896, 99172, 98607, 71748, 198, 220, 471, 220, 125383, 129933, 3709, 95753, 95804, 228, 96238, 110204, 108674, 24178, 120306, 99975, 100653, 3709, 98445, 99172, 96019, 100120, 109446, 100553, 1710, 198, 220, 471, 220, 97273, 134710, 3709, 95979, 109054, 96719, 95865, 101908, 96402, 99986, 1710, 198, 12, 2972, 101878, 98428, 27718, 3468, 4794, 63, 31230, 4960, 198, 220, 471, 220, 97240, 110076, 101804, 95852, 99172, 332, 96442, 103723, 104450, 108813, 101625, 95882, 96206, 98223, 221794, 198, 220, 471, 220, 112787, 99162, 96919, 95789, 99989, 332, 2005, 97559, 101878, 109571, 96983, 96985, 97898, 102069, 173464, 1710, 248046, 198, 248045, 846, 198, 109266, 248046, 198, 248045, 74455, 198, 109266, 6115, 9008, 239, 233, 220, 116577, 108888, 95933, 1710, 271, 98691, 95815, 101009, 97319, 98179, 10992, 108656, 123058, 5205, 101782, 108664, 5205, 101776, 100162, 3709, 96984, 96903, 97019, 97887, 3709, 100675, 110576, 6115, 248046, 198, 248045, 846, 198, 5039, 10835, 95824, 101614, 58046, 98874, 97047, 117668, 10992, 101908, 101052, 3709, 101660, 96186, 110509, 248046, 198, 248045, 74455, 198, 248068, 198]
2026-03-20 18:30:53,397 - lmdeploy - INFO - async_engine.py:382 - session=2, history_tokens=0, input_tokens=1497, max_new_tokens=39463, seq_start=True, seq_end=True, step=0, prep=True
2026-03-20 18:30:53,397 - lmdeploy - INFO - turbomind.py:687 - [async_stream_infer] session 2 start
[TM][INFO] [SeqMgr][Create] ID 2
[TM][WARNING] [ProcessInferRequests] [2] total sequence length (1497 + 39463) exceeds `session_len` (33024), `max_new_tokens` is truncated to 31527
[2026-03-20 18:31:01 DP0] Avg thr (in/out): 149.8 / 4.2 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 4.8%,
[2026-03-20 18:31:11 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 5.0%,
[2026-03-20 18:31:21 DP0] Avg thr (in/out): 0.0 / 6.3 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 5.2%,
[2026-03-20 18:31:31 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 5.4%,
[2026-03-20 18:31:41 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 5.6%,
[2026-03-20 18:31:51 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 5.8%,
[2026-03-20 18:32:01 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 6.0%,
[2026-03-20 18:32:11 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 6.2%,
[2026-03-20 18:32:21 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 6.4%,
[2026-03-20 18:32:31 DP0] Avg thr (in/out): 0.0 / 6.4 tokens/s, API server (completed/routed/waiting): 1 / 1 / 0, Engine (running/waiting): 1 / 0, KV cache: 6.6%,

@tuilakhanh
Copy link
Contributor

Confirmed that it can run, but there's no speed improvement compared to the current main branch. Could this be a specific issue on the Windows platform?

Try the MoE model; in my testing, I saw little improvement over the dense model.

@lingyezhixing
Copy link

Confirmed that it can run, but there's no speed improvement compared to the current main branch. Could this be a specific issue on the Windows platform?

Try the MoE model; in my testing, I saw little improvement over the dense model.

Indeed, MoE has achieved significant improvement

lapy pushed a commit to lapy/lmdeploy that referenced this pull request Mar 20, 2026
@lvhan028 lvhan028 merged commit 764f35a into InternLM:main Mar 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants