Skip to content

refactor(high_performance_ds): Use Parallel to process DataFrame sort…#2244

Open
bincrack wants to merge 1 commit into
microsoft:mainfrom
bincrack:refactor/high_performance_ds_parallel
Open

refactor(high_performance_ds): Use Parallel to process DataFrame sort…#2244
bincrack wants to merge 1 commit into
microsoft:mainfrom
bincrack:refactor/high_performance_ds_parallel

Conversation

@bincrack
Copy link
Copy Markdown

@bincrack bincrack commented Jun 2, 2026

…ing in parallel

Description

NumpyQuote processes all data in a single thread, which requires a long execution time when dealing with very large data volumes. use parallel fix

The Fix

Original code (qlib/backtest/high_performance_ds.py):

quote_dict = {}
for stock_id, stock_val in quote_df.groupby(level="instrument", group_keys=False):
    quote_dict[stock_id] = idd.MultiData(stock_val.droplevel(level="instrument"))
    quote_dict[stock_id].sort_index()  # To support more flexible slicing, we must sort data first

Fixed code:

from joblib import delayed
from ..config import C
from ..utils.paral import ParallelExt

def sort_index(stock_df: pd.DataFrame):
    quote = idd.MultiData(stock_df.droplevel(level="instrument"))
    quote.sort_index()
    return quote

class NumpyQuote in init

workers = max(min(C.get_kernels(freq), len(quote_df)), 1)
inst_l = []
task_l = []
for stock_id, stock_val in quote_df.groupby(level="instrument"):
    inst_l.append(stock_id)
    task_l.append(
        delayed(sort_index)(
            stock_val
        )
    )
quote_dict = dict(
    zip(
        inst_l,
        ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l),
    )
)

Motivation and Context

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

  1. Pipeline test:
  2. Your own tests:

Types of changes

  • [ Y ] Fix bugs
  • Add new feature
  • Update documentation

@bincrack
Copy link
Copy Markdown
Author

bincrack commented Jun 2, 2026

@microsoft-github-policy-service agree

@bincrack please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.@bincrack 请仔细阅读以下的《贡献者许可协议》(CLA)。如果您同意该协议,请回复以下信息。

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.(默认值:未指定公司)我对自己提交的各项知识产权拥有独家所有权。我提交这些内容并非是在为雇主工作期间所为。
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.(如公司已明确说明)我是在为雇主履行工作职责而提交相关内容/成果的(或者,根据合同或相关法律,我的雇主拥有我所提交内容/成果的知识产权)。我已获得雇主的授权,可以代表雇主来提交这些内容/成果并签署本协议。在下方签字时,术语“您”既指我,也指我的雇主。
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant