[Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing
Enhance Qlib's observability infrastructure to support structured logging, performance metrics collection, and workflow tracing.
Currently, Qlib uses basic Python logging via get_module_logger and TimeInspector for timing. This proposal aims to build upon the existing infrastructure to provide comprehensive observability capabilities that help users monitor, debug, and optimize their quantitative research workflows.
Motivation
1. Application Scenarios
Data Pipeline Debugging
When processing large datasets through the DatasetH → DataHandlerLP → Processor chain, it's difficult to identify bottlenecks:
- Cache hit/miss rates (
ExpressionCache, DatasetCache) are not exposed
- Memory usage during
D.features() calls is invisible
Model Training Monitoring
TrainerR / TrainerRM lack visibility into per-epoch resource consumption:
- No metrics for comparing training efficiency across different models in
qlib/contrib/model/
DelayTrainer execution timeline is hard to trace
Backtest Performance Analysis
- Exchange order execution timing not captured
- Executor decision-making latency invisible
- Nested executor scenarios (
NestedExecutor) are especially hard to debug
Online Serving Observability
OnlineManager model update cycles lack monitoring
- Rolling training (
qlib/contrib/rolling/) progress tracking is limited
2. Related Works
OpenTelemetry Python
3. Important Information
Current infrastructure to build upon
qlib/log.py: QlibLogger, TimeInspector, get_module_logger
qlib/workflow/recorder.py: log_metrics() for experiment metrics
qlib/config.py: logging_config for log configuration
Proposed Solution
Phase 1: Enhanced Logging (Low effort, High value)
Extend qlib/log.py to support structured logging.
Example usage
from qlib.log import get_module_logger
logger = get_module_logger("data.handler", structured=True)
logger.info("Dataset loaded", extra={
"dataset_size": len(dataset),
"features_count": 158,
"time_range": "2020-01-01 to 2023-12-31",
"cache_hit": True
})
Configuration via qlib.init()
qlib.init(
provider_uri="~/.qlib/qlib_data/cn_data",
logging_config={
"structured": True,
"format": "json", # or "console"
}
)
Phase 2: Performance Metrics Collection
Add optional metrics collection to key components.
Example (data layer)
# In qlib/data/data.py
class LocalDatasetProvider:
def dataset(self, ...):
with MetricsCollector.timer("data.dataset.load_time"):
# existing logic
MetricsCollector.gauge("data.dataset.memory_mb", get_memory_usage())
MetricsCollector.counter("data.dataset.cache_hits", cache_hit_count)
Expose metrics via
- Prometheus-compatible endpoint (optional)
R.log_metrics() integration for experiment correlation
- Console summary at workflow end
Phase 3: Workflow Tracing (Optional)
Add context propagation for complex workflows.
Automatic span creation for key operations
with R.start(experiment_name="test"):
# trace_id automatically propagated
dataset = init_instance_by_config(task["dataset"]) # span: dataset.init
model.fit(dataset) # span: model.fit
backtest(...) # span: backtest.execute
Configuration Design
# workflow_config.yaml
qlib_init:
provider_uri: "~/.qlib/qlib_data/cn_data"
observability:
enabled: true
structured_logging: true
metrics:
enabled: true
export: "prometheus" # or "console", "mlflow"
tracing:
enabled: false # opt-in for advanced users
Alternatives
- Keep current approach: Use
TimeInspector.logt() manually — lacks structured data and aggregation
- External APM tools: Requires significant integration effort and may not understand Qlib-specific semantics
- MLflow-only: Already integrated but focused on experiment tracking, not system observability
Additional Notes
Backward Compatibility
- All features opt-in via configuration
- Default behavior unchanged
- Zero overhead when disabled
Implementation Priority
- Structured logging in
qlib/log.py (1–2 PRs)
- Key metrics in
qlib/data/ and qlib/model/trainer.py (2–3 PRs)
- Backtest metrics in
qlib/backtest/ (1–2 PRs)
- Tracing (future, based on community feedback)
Affected Modules
qlib/log.py — Core changes
qlib/config.py — New configuration options
qlib/data/data.py, qlib/data/cache.py — Data layer metrics
qlib/model/trainer.py — Training metrics
qlib/backtest/exchange.py, qlib/backtest/executor.py — Backtest metrics
Are you willing to submit a PR?
[Feature] Enhance Qlib Observability: Structured Logging, Metrics, and Workflow Tracing
Enhance Qlib's observability infrastructure to support structured logging, performance metrics collection, and workflow tracing.
Currently, Qlib uses basic Python logging via
get_module_loggerandTimeInspectorfor timing. This proposal aims to build upon the existing infrastructure to provide comprehensive observability capabilities that help users monitor, debug, and optimize their quantitative research workflows.Motivation
1. Application Scenarios
Data Pipeline Debugging
When processing large datasets through the
DatasetH → DataHandlerLP → Processorchain, it's difficult to identify bottlenecks:ExpressionCache,DatasetCache) are not exposedD.features()calls is invisibleModel Training Monitoring
TrainerR / TrainerRMlack visibility into per-epoch resource consumption:qlib/contrib/model/DelayTrainerexecution timeline is hard to traceBacktest Performance Analysis
NestedExecutor) are especially hard to debugOnline Serving Observability
OnlineManagermodel update cycles lack monitoringqlib/contrib/rolling/) progress tracking is limited2. Related Works
OpenTelemetry Python
3. Important Information
Current infrastructure to build upon
qlib/log.py:QlibLogger,TimeInspector,get_module_loggerqlib/workflow/recorder.py:log_metrics()for experiment metricsqlib/config.py:logging_configfor log configurationProposed Solution
Phase 1: Enhanced Logging (Low effort, High value)
Extend
qlib/log.pyto support structured logging.Example usage
Configuration via
qlib.init()Phase 2: Performance Metrics Collection
Add optional metrics collection to key components.
Example (data layer)
Expose metrics via
R.log_metrics()integration for experiment correlationPhase 3: Workflow Tracing (Optional)
Add context propagation for complex workflows.
Automatic span creation for key operations
Configuration Design
Alternatives
TimeInspector.logt()manually — lacks structured data and aggregationAdditional Notes
Backward Compatibility
Implementation Priority
qlib/log.py(1–2 PRs)qlib/data/andqlib/model/trainer.py(2–3 PRs)qlib/backtest/(1–2 PRs)Affected Modules
qlib/log.py— Core changesqlib/config.py— New configuration optionsqlib/data/data.py,qlib/data/cache.py— Data layer metricsqlib/model/trainer.py— Training metricsqlib/backtest/exchange.py,qlib/backtest/executor.py— Backtest metricsAre you willing to submit a PR?