fix: implement lazy transpose to resolve slow Tensor::transpose#436
Closed
Aakash-Naidu wants to merge 2339 commits intopaiml:mainfrom
Closed
fix: implement lazy transpose to resolve slow Tensor::transpose#436Aakash-Naidu wants to merge 2339 commits intopaiml:mainfrom
Aakash-Naidu wants to merge 2339 commits intopaiml:mainfrom
Conversation
…mlGH-279) Bartowski Qwen3 GGUFs set general.architecture="qwen2" but the model is actually Qwen3. The completeness gate then requires attn_q.bias (Qwen2-specific) which Qwen3 doesn't have, causing 8/38 test failures. New verify_architecture_from_tensor_evidence() uses provable signals: - attn_q_norm.weight present → Qwen3 (logical implication, not heuristic) - attn_q.bias present → Qwen2 or Phi (never in Qwen3/LLaMA) Called from both GGUF and SafeTensors import paths before name mapping. Also fixes infer_architecture_from_names() to detect Qwen3 via QK norm. 9 new Popperian tests (7 unit + 1 infer_arch + 1 contract integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
realizar 0.8.0 on crates.io lacks BF16/F16 embedding support (commit 573b64d). Path patch ensures apr-cli builds with the local realizar containing the fix until 0.8.1 is published. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…H-187) TensorDType::BF16 was encoded as byte 2, but realizar's APR loader uses GgmlQuantType::from_id() where byte 2 = Q4_0 (GGML standard). This caused BF16 embedding data to be misinterpreted as Q4_0 quantized blocks, failing with "Q4_0 data length is not a multiple of block size 18". Changed BF16 from 2 to 30 to match GGML standard (GgmlQuantType::BF16=30). No backward compat concern: BF16 passthrough was added recently and all existing APR files with BF16=2 produce garbage output anyway. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…) (Refs paimlGH-356) download_file() mapped all non-401 HTTP errors to CliError::NetworkError, making it impossible for callers to distinguish "file not found" (404) from actual network failures. This caused optional companion file downloads (tokenizer.json, tokenizer.model) to report misleading errors instead of cleanly skipping missing files. Changes: - Add CliError::HttpNotFound variant with exit code 11 - Handle ureq::Error::Status(404, _) in download_file() - Update download_companion_files() to match HttpNotFound specifically, printing "not found in repo" for optional files and a clear 404 message for required files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on rounds (Refs PMAT-342) Major update to apr-checkpoints.md: - §7 rewritten: full command-checkpoint interaction matrix covering all 51+ apr CLI subcommands across 12 categories (was only 4 tools) - §13 synced with apr-checkpoint-v1.yaml (F-CKPT-001..018) - §17 falsification log: 60 issues across 4 rounds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…CKPT-016) Adds filtered reader that skips __training__.* tensors at load time, enabling inference readers to load .ckpt.apr files without pulling optimizer state into memory. Refs PMAT-342 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refs PMAT-342 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implemented: F-CKPT-001, 002, 004, 005, 006, 007, 010, 011, 015, 016, 017 Deferred: F-CKPT-003, 008, 009, 013, 014, 018 N/A: F-CKPT-012 (SafeTensors-specific) Refs PMAT-342 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- F-CKPT-009: AprWriter::write() uses tmp+fsync+rename for crash safety - F-CKPT-013: AprReader::read_tensor_f32_checked() rejects NaN/Inf - F-CKPT-014: AprReader::validate_tensor_shape() checks element counts Refs PMAT-342 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 new FALSIFY tests covering: - F-CKPT-009: atomic write (no orphan .tmp) - F-CKPT-011: CRC32 corruption rejection - F-CKPT-013: NaN/Inf detection (3 tests) - F-CKPT-014: shape validation (2 tests) - F-CKPT-015: canonical ordering - F-CKPT-016: training tensor filtering - F-CKPT-018: write→read→write bit-identical round-trip Refs PMAT-342 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mlGH-366) Parse tokenizer.model (protobuf) as fallback when tokenizer.json is absent. Extract vocabulary + scores and embed in APR metadata for Viterbi encoding. Also fixes BF16 dtype test (BF16=30 from prior paimlGH-187 fix). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark all 6 previously-deferred contracts as Done: - F-CKPT-003: No training state in adapter - F-CKPT-008: Write shape validation - F-CKPT-009: Atomic writes (tmp+fsync+rename) - F-CKPT-013: Post-load NaN scan - F-CKPT-014: Shape-config validation - F-CKPT-018: APR round-trip bit-identical Add dogfood example exercising all 10+ contract checks in a single lifecycle: create → full read → filtered read → adapter-only → round-trip verification. Refs paimlGH-334 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s (Refs GGUF-SERVE-001) apr serve produced garbage output (character-level fallback) when serving GGUF-imported .apr files because it only looked for a sibling tokenizer.json file. GGUF imports embed the tokenizer in APR metadata instead. Now tries model.load_embedded_bpe_tokenizer() first, falls back to sibling tokenizer.json, then character-level as last resort. Also fixes all clippy warnings in modified modules (data, diagnose, train, serve, finetune) and adds missing oversample parameter to finetune tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ining Implements plan/apply pattern for tokenizer training: - `apr tokenize plan` validates corpus, estimates time, outputs plan (text/JSON) - `apr tokenize apply` trains BPE/WordPiece/Unigram tokenizer, writes vocab.json + merges.txt Supports all three aprender tokenizer algorithms (bpe, wordpiece, unigram). Refs ALB-001 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aimlGH-205) The paimlGH-205 F16 passthrough unconditionally returned raw bytes before the quantization dispatch could run, making --quantize a no-op for any SafeTensors model with F16/BF16 tensors (i.e., all of them). Now the passthrough only fires when quantize is None or Fp16. For Int8/Int4/Q4K, the raw F16 bytes are dequantized to F32 on the fly and fed into dispatch_quantize (extracted helper to avoid duplicated match logic). Also extends the F16 test factory to support attention, norm, MLP, and bias tensors for complete Qwen2 architecture coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --task code for code completion benchmarks (JSONL with prompt/test/ canonical_solution). Reports pass@1 rate with per-problem breakdown. Add --task plan for dry-run validation (model + data exist). Phase 1: validates benchmark structure, checks canonical solutions. Phase 2 (requires inference feature): full model inference + test exec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) dispatch_quantize called quantize_q4_k (flat 1D) instead of quantize_q4_k_matrix (row-aware with 256-element boundary padding). When row width is not a multiple of 256, flat quantization produces blocks that span row boundaries, corrupting weights at inference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Every apr action command now exposes plan mode for dry-run validation: - merge --plan: validates inputs, strategy, weights without executing - export --plan: validates model file, format, output without exporting - publish --plan: alias for existing --dry-run - Pre-dispatch contract validation skipped in plan mode Refs ALB-023 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fs ALB-028) New `apr pipeline` subcommand for ML pipeline orchestration: - `plan`: validate manifest and show execution DAG with estimates - `apply`: converge resources via forjar engine - `status`: show converged/pending/failed state - `validate`: check manifest without connecting to machines Shells out to forjar binary, keeping sovereign stack tools decoupled. Follows train/tokenize plan/apply subcommand pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dispatches to entrenar's InstructPipeline when --task instruct is used. Supports model sizes tiny/0.5B/7B/9B, JSONL corpus loading, and both human-readable and JSON output modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend apr train to support --task pretrain in addition to classify: - plan: loads YAML config via entrenar::config::load_config/validate_config, shows model architecture, training params, and validation verdict - apply: calls entrenar::config::train_from_yaml for full training loop with TransformerTrainer + CausalLMLoss New CLI parameters: --config <yaml>, --task pretrain|classify. Existing classify behavior unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…GH-87) APR GPU inference used AprV2ModelCuda (generic transformer, 0.5 tok/s). Now uses MappedAprModel → OwnedQuantizedModel::from_apr() → OwnedQuantizedModelCuda (fused Q4K/Q6K GEMV, 224 tok/s on RTX 4090). Both benchmark and serve paths updated: - benchmark.rs: uses generate_gpu_resident() with QuantizedGenerateConfig - handlers_include_01.rs: uses AppState + create_router (same as GGUF) - handlers.rs: simplified GPU dispatch caller Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-011)
Add --config <yaml> and --stage <precompute|train> to apr distill.
Config-driven mode:
--plan: validates YAML config, shows teacher/student/training params,
displays two-stage workflow instructions
--stage precompute: loads teacher via RosettaStone, inspects tensors,
writes manifest.json for stage 2
--stage train: reads precompute manifest, validates, sets up KD training
Local DistillYamlConfig types match entrenar's schema (teacher, student,
distillation, training, dataset, output sections). Uses serde_yaml_ng
for YAML parsing.
Teacher model changed from required positional to optional (config mode
doesn't need positional teacher path). Existing file-based mode preserved.
(Refs ALB-011)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…B-007) Refs ALB-007 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…H-375, paimlGH-90) Two fixes: 1. paimlGH-375: GGUF import of Q4_0/Q5_0/Q8_0 files now falls back to the dequant→requant path instead of failing. The raw path rejects these legacy GGML types; the fallback loads as f32 and optionally re-quantizes to Q4K via --quantize. 2. paimlGH-90: Brick benchmarks without run() implementations now report their analytical budget estimate instead of timing a no-op budget() call. Previously reported 0.02us (55M tok/s) for QKV — clearly synthetic. Now reports the real budget (6.0us) with a clear label. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…elog (Refs paimlGH-88) - README: Add GPU single-decode row (240 tok/s APR Q4K via paimlGH-88) - CHANGELOG: Add paimlGH-375 (Q4_0 import fallback) and paimlGH-90 (honest bricks) - Book: Update apr bench docs with --fast GPU example and brick disclaimers - Book: Update performance table with measured Qwen 1.5B numbers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closed-form memory projection from architecture parameters, backed by training-memory-kernel-v1.yaml contract. Shows parameter count and breakdown: weights, gradients, optimizer (AdamW m+v), activations (upper bound), CUDA overhead. Refs paiml/albor#39 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thread --quantize-nf4 through CLI → dispatch → finetune → TrainingConfig. Enables NF4 4-bit weight quantization for ~8x VRAM savings on frozen layers. Also patch trueno-gpu to local source for NF4 kernel support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paiml#335) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l#376) Replace hardcoded model_size match table in run_instruct() with resolve_transformer_config() that reads architecture from .apr file metadata header via read_apr_architecture(). Only reads 64-byte header + metadata JSON (~4 KB), not the full model file. Falls back to --model-size string matching only when no .apr file is provided, and errors on unrecognized sizes instead of silently degrading to TransformerConfig::tiny(). Before: Model: 64h x 2L (vocab 1000) — garbage toy model After: Model: 4096h x 36L (vocab 151936) — correct Qwen3-8B Five-whys root cause: run_instruct() was written independently of the LoRA path and never consulted the .apr file's architecture metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paiml#498) with_length_bounds(0, max) set min_length=0 which could cause division by zero in score_length() if the code path were reached. Add assert!(min > 0) validation in the builder. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The parity gate (GPU/CPU logits cosine similarity check) has known false positives on CUDA 13.1 driver, causing profile_gpu_generation to fail silently and fall back to CPU (28 tok/s instead of 159 tok/s). The serve command already had SKIP_PARITY_GATE=1 but profile did not. Also log the actual error instead of silently swallowing it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ix) (Refs paiml#140) realizar 0.8.3 adds no-op release_cpu_pages/advise_sequential stubs on non-Unix platforms, fixing the nightly Windows build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lan (Refs PMAT-068, PMAT-069, PMAT-071) 1. prune --analyze: read actual param_count from APR metadata instead of wrong file_size*2 estimate (showed 14.2B for 1.8B model) 2. data audit --json: use observed class count from data instead of hardcoded default_value=5 for num_classes 3. merge --plan: make --output optional in plan mode via required_unless_present="plan" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fs PMAT-072) Previously, --plan early-returned before checking --batch, so `apr quantize model.apr --batch int8,fp16,int4 --plan` silently ignored the batch flag and only showed a single-scheme plan. Restructured control flow: batch check now precedes plan_only check, and added run_batch_plan() that iterates all schemes with size estimates, reduction ratios, and peak memory for each. Fixes paiml#502 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… (Refs PMAT-072, PMAT-073) convert: `apr convert` with neither --quantize nor --compress now returns a validation error instead of silently copying the file. distill: `apr distill --config text.yaml --plan` now shows a plan with prompt count, estimated samples/tokens, and stage info instead of silently ignoring the --plan flag. Fixes paiml#503 Fixes paiml#504 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ag (Refs PMAT-074, PMAT-075) tokenize: `--format yaml` now returns an error instead of silently outputting JSON (serde_yaml is not in apr-cli deps). pipeline: `--parallel N` was passing `-p N` to forjar, but forjar's `-p` is `--param KEY=VALUE`, not parallelism. Now returns an error since forjar has no parallelism flag. Fixes paiml#505 Fixes paiml#506 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ncluded in JSON (Refs PMAT-076) --strict was silently ignored for all formats. Now for Rosetta path, strict mode fails validation if NaN, Inf, or all-zero tensors exist. --quality --json for GGUF/SafeTensors now includes physics constraint details in JSON output (was only shown in text mode). Fixes paiml#507 Fixes paiml#508 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…078) Both `apr encrypt` and `apr decrypt` now check if the output file exists and refuse to overwrite without `--force`, matching the behavior of convert, quantize, and other commands (F-CONV-064). Fixes paiml#509 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ual files paimlGH-510: explain_error_code(), explain_tensor(), explain_file() now emit structured JSON when --json is set, matching explain_kernel() behavior. paimlGH-511: generate_model_card() uses actual uploaded files to build the Available Formats table instead of hardcoding model.apr/model.safetensors. to_huggingface_extended() accepts file_names parameter for dynamic output. (Refs PMAT-079, PMAT-080) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ts unimplemented paimlGH-512: bench --fast was silently ignored (always uses fast path). Now emits deprecation warning instead of silently dropping the flag. paimlGH-513: prune --calibration was displayed in config table but never passed to execute_pruning(). Now rejects with explicit error since calibration-aware pruning is not implemented. (Refs PMAT-081, PMAT-082) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paimlGH-514: validate_merge_weights() println calls now guarded by json_output flag, preventing weight debug output from corrupting JSON stream. paimlGH-515: canary create_canary() now validates --input file exists instead of silently ignoring it via underscore prefix. (Refs PMAT-083, PMAT-084) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paimlGH-516: run --language and --task now emit deprecation warning instead of silently ignoring. Whisper integration not yet wired to inference engine. paimlGH-517: profile --compare-hf, --energy, --callgraph, --fail-on-naive now emit warnings instead of silently ignoring via underscore prefix. paimlGH-518: tune --freeze-base and --train-data now emit warnings instead of silently ignoring via underscore prefix. (Refs PMAT-085, PMAT-086, PMAT-087) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- paimlGH-519: slice JSON output now includes actual tensor shape instead of hardcoded [count]. Added original_shape param to output_slice_result() and updated all 3 callers (GGUF, APR, SafeTensors). - paimlGH-520: chat command now warns when --trace-steps, --trace-verbose, or --profile flags are passed (not yet implemented for chat mode). - paimlGH-521: train watch now warns that --heartbeat-timeout monitoring is not yet implemented (value was displayed but never used for hang detection). (Refs PMAT-088, PMAT-089, PMAT-090) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n distributed flags - paimlGH-522: oracle command now warns when --compliance or --tensors flags are used with HuggingFace URIs (these require local file access). - paimlGH-523: finetune command now warns when --bind, --coordinator, or --expect-workers are provided without --role (distributed training requires unreleased entrenar APIs). (Refs PMAT-091, PMAT-092) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- paimlGH-524: check command now warns that --no-gpu is not yet implemented for GGUF validation (was silently ignored). - paimlGH-525: eval compare now uses --data as second model fallback when model_b not provided positionally (was silently ignored despite error message telling users to use --data). - paimlGH-526: cbtop TUI now warns that --attach is not yet implemented (was silently ignored). (Refs PMAT-093, PMAT-094, PMAT-095) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paimlGH-527: collect_ppl_benchmark_pairs() now checks the explicit --data path first before falling back to scanning the model directory. Previously the data_path parameter was silently ignored (underscore-prefixed). (Refs PMAT-096) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- run.rs: --force now skips HF and URL cache checks (paimlGH-528) - layer.rs: --verbose and layers data warn when not yet used (paimlGH-529) - flow.rs: reader data warns when not yet used in encoder/decoder flow (paimlGH-530) (Refs PMAT-097, PMAT-098, PMAT-099) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aram name - validate.rs: --strict warns when not yet implemented in APR path (paimlGH-531) - gpu_isolation_result.rs: renamed _config to config, used in both cfg branches (paimlGH-532) (Refs PMAT-100, PMAT-101) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Renamed _num_layers to num_layers and added warning that per-layer normalization is not yet implemented. (paimlGH-533) (Refs PMAT-102) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-117) dispatch.rs/dispatch_analysis.rs: --gpu was pattern-matched as _ and discarded — now overrides --no-gpu when both specified. handler_apr_cpu_completion.rs: channel send errors and generation failures were silently swallowed via let _ — clients could hang indefinitely. Now logged to stderr. kernel.rs: GPU warmup/prefill failures silently discarded while timing metrics still recorded. Now logged per-pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Load model once, process all prompts from JSONL input.
Eliminates per-invocation CUDA JIT overhead (~80s on sm_121).
Usage: apr run model.apr --batch-jsonl prompts.jsonl --max-tokens 512
Input: {"prompt": "...", "task_id": "..."}
Output: {"text": "...", "tokens_generated": N, "tok_per_sec": X}
(Refs PMAT-037)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…MAT-037) Passes temperature and top_k through to BatchInferenceConfig instead of hardcoding 0.0/1. Enables non-greedy batch eval (pass@k with k>1). Defaults unchanged: temperature=0.0, top_k=1 (greedy). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…1 contracts Add pv-compatible proof infrastructure to all kernel contracts that were missing it. pv lint now passes 3/3 gates on all 10 parseable contracts. Contracts fixed (7 missing all three sections): - softmax-kernel-v1: 3 obligations (normalization, bounds, monotonicity) - element-wise-ops-v1: 8 obligations (commutativity, relu, sigmoid, tanh, involution) - normalization-kernel-v1: 2 obligations (unit variance, SIMD parity) - rope-kernel-v1: 2 obligations (magnitude preservation, SIMD parity) - transpose-kernel-v1: 2 obligations (involution, shape reversal) - matvec-kernel-v1: 2 obligations (SIMD parity, norm bound) - apr-cpu-q4k-routing-v1: 3 obligations (ordering, invariance, perf bound) Contracts fixed (4 missing kani_harnesses only): - compile-v1, data-quality-v1, hpo-v1, inspection-v1 Closes paiml#536 Refs PMAT-122 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create provable-contracts binding config mapping all 55 equations across 16 parseable contracts to their implementing functions in aprender. Binding status: - 29 implemented (function exists and matches contract semantics) - 17 partial (function exists at different path or partial coverage) - 9 not_implemented (declared function does not exist yet) Key gaps (not_implemented): - compile-v1: compile_model does not exist (CLI has run/generate_cargo_project) - data-quality-v1: split_dataset, oversample, validate_jsonl do not exist - hpo-v1: asha::should_promote does not exist (entrenar has TPE/Hyperband) - training-pipeline-v1: detect_stale, expected_improvement do not exist Closes paiml#535 Refs PMAT-123 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…f_fails (Refs PMAT-149) Reduced pv lint errors. Added kani_harnesses + qa_gate to training-pipeline-v1. Added if_fails to tensor-layout-v1. Fixed special-tokens equations stub, inference subset→invariant. Remaining: ~20 errors from indentation/type issues in batch edits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes the performance issue reported in #388, where Tensor::transpose was significantly slower than ndarray for large matrices.
The previous implementation performed a physical data copy during transpose, resulting in O(n²) complexity. This patch replaces it with a lazy transpose implementation that swaps tensor metadata instead of copying data.
Changes
Implementation Details
The new transpose implementation modifies only metadata:
This avoids copying tensor data.
Physical transposition occurs only when necessary, for example when calling BLAS kernels via Tensor::contiguous().
This design is similar to the lazy transpose approach used by ndarray and other tensor frameworks.
Performance
Bench mark results using aprender-bench-compute
Transpose is now constant time (O(1)), independent of tensor size.
Correctness
Conclusion
This PR resolves the transpose performance regression by introducing a lazy metadata-based transpose implementation, bringing performance in line with modern tensor libraries.