refactor(high_performance_ds): Use Parallel to process DataFrame sort… by bincrack · Pull Request #2244 · microsoft/qlib

bincrack · 2026-06-02T10:14:34Z

…ing in parallel

Description

NumpyQuote processes all data in a single thread, which requires a long execution time when dealing with very large data volumes. use parallel fix

The Fix

Original code (qlib/backtest/high_performance_ds.py):

quote_dict = {}
for stock_id, stock_val in quote_df.groupby(level="instrument", group_keys=False):
    quote_dict[stock_id] = idd.MultiData(stock_val.droplevel(level="instrument"))
    quote_dict[stock_id].sort_index()  # To support more flexible slicing, we must sort data first

Fixed code:

from joblib import delayed
from ..config import C
from ..utils.paral import ParallelExt

def sort_index(stock_df: pd.DataFrame):
    quote = idd.MultiData(stock_df.droplevel(level="instrument"))
    quote.sort_index()
    return quote

class NumpyQuote in init

workers = max(min(C.get_kernels(freq), len(quote_df)), 1)
inst_l = []
task_l = []
for stock_id, stock_val in quote_df.groupby(level="instrument"):
    inst_l.append(stock_id)
    task_l.append(
        delayed(sort_index)(
            stock_val
        )
    )
quote_dict = dict(
    zip(
        inst_l,
        ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l),
    )
)

Motivation and Context

How Has This Been Tested?

Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

Pipeline test:
Your own tests:

Types of changes

[ Y ] Fix bugs
Add new feature
Update documentation

…ing in parallel

bincrack · 2026-06-02T11:38:54Z

@microsoft-github-policy-service agree

@bincrack please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.@bincrack 请仔细阅读以下的《贡献者许可协议》（CLA）。如果您同意该协议，请回复以下信息。
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.（默认值：未指定公司）我对自己提交的各项知识产权拥有独家所有权。我提交这些内容并非是在为雇主工作期间所为。
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.(如公司已明确说明)我是在为雇主履行工作职责而提交相关内容/成果的（或者，根据合同或相关法律，我的雇主拥有我所提交内容/成果的知识产权）。我已获得雇主的授权，可以代表雇主来提交这些内容/成果并签署本协议。在下方签字时，术语“您”既指我，也指我的雇主。
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

refactor(high_performance_ds): Use Parallel to process DataFrame sort…

6bf4e74

…ing in parallel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(high_performance_ds): Use Parallel to process DataFrame sort…#2244

refactor(high_performance_ds): Use Parallel to process DataFrame sort…#2244
bincrack wants to merge 1 commit into
microsoft:mainfrom
bincrack:refactor/high_performance_ds_parallel

bincrack commented Jun 2, 2026 •

edited

Loading

Uh oh!

bincrack commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bincrack commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The Fix

Fixed code:

Motivation and Context

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

Uh oh!

bincrack commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bincrack commented Jun 2, 2026 •

edited

Loading