[QDP] feat: add credit card fraud benchmark + amplitude encoding optimizations#1106
[QDP] feat: add credit card fraud benchmark + amplitude encoding optimizations#1106rich7420 wants to merge 4 commits intoapache:mainfrom
Conversation
|
It seems a little be too big one. Sorry about that. |
viiccwen
left a comment
There was a problem hiding this comment.
Thx for contributing! 🙌
I'll look deeper tomorrow, and I think we should add tests to cover new loader APIs, especially the new behavior crosses Python, PyO3, Rust, and CUDA boundaries.
qdp/qdp-python/qumat_qdp/loader.py
Outdated
| elif kind == "numpy": | ||
| for qt in raw_iter: | ||
| yield _torch.from_dlpack(qt).cpu().numpy() |
There was a problem hiding this comment.
as_torch() validates that torch is installed, but as_numpy() does not. Then _wrap_iterator() calls _torch.from_dlpack(...) for the "numpy" path.
Does it mean as_numpy() can succeed at configuration time and then fail during iteration with an unclear runtime error if PyTorch is not installed. 🤔
There was a problem hiding this comment.
oh , nice catch! you're right
nice |
|
Please solve conflicts |
|
plz take a look and test, not hurry once you have time |
ryankert01
left a comment
There was a problem hiding this comment.
The pennylane baseline uses CPU for training, whereas qdp pipeline has dual CPU/GPU. We can simplify both to use GPU for training always.
|
no problem! |
|
Please solve conflicts |
|
Can you also give some more context in the PR description for why is this needed? |
400Ping
left a comment
There was a problem hiding this comment.
The array-loader optimization claim does not match the implementation in this PR.
create_array_loader() says batching uses slices without per-batch to_vec(), but
PipelineIterator::take_batch_from_source() still clones each in-memory batch with
data[start..end].to_vec().
next_batch() already handles InMemory via zero-copy &data[start..end]. Remove the dead-code .to_vec() clone path that contradicted the documented optimization claim. Addresses review comment from 400Ping.
Great catch! You are right that the code in take_batch_from_source() was misleading. |
Changes
encoding_benchmarks/qdp_pipeline/creditcardfraud_amplitude.py— 5-qubitamplitude VQC on Credit Card Fraud data, aligned with PennyLane baseline (same circuit, loss,
optimizer). Closes the QDP vs baseline training time gap from ~22% slower to <1% gap.
encoding_benchmarks/pennylane_baseline/creditcardfraud_amplitude.py—PennyLane reference implementation with AUPRC/F1 metrics for imbalanced data.
QuantumDataLoaderAPI: addedsource_array(X)(in-memory, no temp file),as_torch(device), andas_numpy()for ergonomic batch output format.PipelineIterator: addednew_from_array()constructor;InMemorynext_batchnowpasses
&data[start..end]slice directly (no per-batchto_vec()).amplitude.rs: moved D2H norm validation to after encode kernel +device.synchronize(),eliminating a mid-pipeline GPU→CPU roundtrip in
encode_batch.requires_grad=Falseon all data arrays toprevent AdamOptimizer from computing unnecessary gradients through state vectors;
AmplitudeEmbedding(normalize=False)in place ofStatePrep;.realextraction aftertorch.from_dlpack()to convert complex128 DLPack output to float64.Motivation
The existing QDP benchmark suite only covers the Iris dataset (100 samples, 2 qubits), which is too small to surface real-world data-loading and encoding bottlenecks. Credit Card Fraud (284,807 transactions, 5 qubits) is a standard imbalanced-classification benchmark from Kaggle/OpenML that stresses the full QDP pipeline — batch iteration, GPU encoding, and training — at realistic scale.
Adding this benchmark serves this purposes:
source_array(X),as_torch(), andas_numpy()are exercised end-to-end by the new benchmark and tests, catching integration issues across the Python → PyO3 → Rust → CUDA boundary.Checklist