-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Description
"Scale Intelligence, Accelerate Insights"
Building on 2025's achievements in vector search and indexing capabilities, Apache Doris continues to deepen its AI support in 2026. This roadmap focuses on advancing AI & Hybrid Search capabilities while enhancing query performance, storage efficiency, and data lake integration.
AI & Hybrid Search Innovation:
- Scale vector index to support 10 billion vectors per table with disk-based ANN
- Enhance full-text search with query expressions, scoring, and multi-index support
- Extend hybrid search to Iceberg for unified analytics
Core Enhancements:
- Query engine optimization for complex data types and ETL processing
- Storage improvements for ultra-large tablets and compute-storage separation
- Data lake integration with Iceberg V3 and Paimon support
Roadmap 2025
Roadmap 2024
Roadmap 2023
Roadmap 2022
AI & Hybrid Search
Vector Index
- Implement index-only scan for vector index
- Implement disk-based ANN (Approximate Nearest Neighbor) for vector index
- Optimize compaction policy for vector index
- Enhance vector index capability to support 10 billion vectors in a single table
- Introduce vector index support for Iceberg tables
Full-Text Search
- Introduce more query expressions: query string and Boolean query
- Implement scoring functionality in the text index
- Introduce multi-index support for a single column
- Add text index support for Iceberg tables
- Integrate scoring with global lazy materialization
Query Engine
Performance
- Optimize column pruning for complex data types (struct, array, map)
- Optimize expression execution for cases such as
CASE WHENand non-constLIKE - Enhance partition pruning capability
- Optimize broadcast join performance
- Implement query condition cache functionality
- Enhance zonemap evaluation to support expressions
ETL/Incremental Processing
- Enhance spill-to-disk capability to support TPC-DS 10TB workload using 16GB memory
- Implement
MERGE INTOstatement - Implement binlog and incremental materialized view functionality
- Implement global query buffer management to reduce memory usage for single queries and make query usage more predictable
- Implement progress bar for long-running queries
New Features
- Implement UNNEST functionality
- Implement recursive CTE (Common Table Expression)
- Implement ASOF join functionality
- Introduce Python UDF (User-Defined Function) support
- Introduce nested variant data type support
- Enhance function compatibility with Snowflake
New DataTypes
- Introduce timestamp with timezone data type
- Introduce binary data type
Enhancement
- Unify predicate and expression framework between external tables and internal tables
- Implement short-circuit expression evaluation
- Unify local exchange and global exchange, and move local exchange to FE planner
Data Storage
Storage Format
- Optimize compression ratio for string data
- Enhance storage format to support 10k columns in a single file
- Optimize column metadata management for random access
- Optimize nullable column read performance
- Optimize storage for sparse columns in variant data type
- Implement partial update functionality for variant sub-fields
Data management
- Enhance tablet management to support ultra-large tablets (100GB+)
- Optimize MOW (Merge-On-Write) import performance for large tablets
File Cache
- Implement table-level cross-compute group synchronized preheating
- Implement partition time-based TTL (Time-To-Live) support
- Enhance SQL query capability for more granular and reliable cache usage statistics
- Optimize diskless/slow disk scenarios to prevent local disk from becoming a file cache throughput bottleneck
- Implement cache black/white list policy for fine-grade cache management.
Compute-Storage Separation
- Implement ultra-fast elastic balance scheduling
- Enhance read-write separation: bind compaction to write compute groups
- Implement distributed cache support for sharing cache across multiple compute groups
- Enhance persistent metadata caching to reduce dependency on metadata service and improve performance
Data Import
- Optimize memory management for large imports with many active tablets that may result in many small files: implement memtable disk spill
- Optimize memory control for scenarios with very large single-row single-column data
- Introduce support for more data import sources, such as AWS Kinesis
Data Lakes
Lake Format Performance
- Implement Parquet format Page Cache capability
- Enable Data Cache by default
- Enhance metadata parsing, planning, and caching for ultra-large scale Iceberg and Paimon
- Implement Condition Cache for Iceberg and Paimon
Materialized View
- Implement snapshot-level incremental refresh for materialized views based on Iceberg and Paimon
- Implement materialized view construction based on Paimon and Iceberg
Data interoperability
- Implement comprehensive Iceberg V3 support
- Implement Iceberg data sorting functionality
- Implement Iceberg Data Rewrite functionality
- Implement Iceberg Delete/Update functionality
- Implement Iceberg/Parquet Variant data type support
- Implement Paimon data write
- Implement native reader for Paimon MOR (Merge-On-Read) tables
- Implement Fluss integration
- Implement Paimon Vector and Blob data type support
- Implement standardized Arrow Flight Data Catalog
Metadata Interoperability
- Implement unified permission management for Iceberg REST Catalog
- Implement integration with third-party authentication and authorization systems
- Implement Open Metadata API
Security
- Enhance object storage support for IAM role-based authentication from more cloud vendors
Others
- Refactor all third-party builds to use CMake
- Implement hermetic build support
Ruffianjiang, XinzhuangL, codegit10001, cambyzju, qinchencq and 3 morezclllyybb
Metadata
Metadata
Assignees
Labels
No labels