Execution Engine Configuration
DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.
Available Engines
| Engine | Description | Best For |
|---|---|---|
auto | Automatically selects best engine per operation | General use (default) |
chdb | Forces all operations through ClickHouse SQL | Large datasets, aggregations |
pandas | Forces all operations through pandas | Compatibility testing, pandas-specific features |
Setting the Engine
Global Configuration
Checking Current Engine
Auto Mode
In auto mode (default), DataStore selects the optimal engine for each operation:
Operations Executed in chDB
- SQL-compatible filtering (
filter(),where()) - Column selection (
select()) - Sorting (
sort(),orderby()) - Grouping and aggregation (
groupby().agg()) - Joins (
join(),merge()) - Distinct (
distinct(),drop_duplicates()) - Limiting (
limit(),head(),tail())
Operations Executed in pandas
- Custom apply functions (
apply(custom_func)) - Complex pivot tables with custom aggregations
- Operations not expressible in SQL
- When input is already a pandas DataFrame
Example
chDB Mode
Force all operations through ClickHouse SQL:
When to Use
- Processing large datasets (millions of rows)
- Heavy aggregation workloads
- When you want maximum SQL optimization
- Consistent behavior across all operations
Performance Characteristics
| Operation Type | Performance |
|---|---|
| GroupBy/Aggregation | Excellent (up to 20x faster) |
| Complex Filtering | Excellent |
| Sorting | Very Good |
| Simple Single Filters | Good (slight overhead) |
Limitations
- Custom Python functions may not be supported
- Some pandas-specific features require conversion
pandas Mode
Force all operations through pandas:
When to Use
- Compatibility testing with pandas
- Using pandas-specific features
- Debugging pandas-related issues
- When data is already in pandas format
Performance Characteristics
| Operation Type | Performance |
|---|---|
| Simple Single Operations | Good |
| Custom Functions | Excellent |
| Complex Aggregations | Slower than chDB |
| Large Datasets | Memory intensive |
Cross-DataStore Engine
Configure the engine for operations that combine columns from different DataStores:
Example
Engine Selection Logic
Auto Mode Decision Tree
Function-Level Override
Some functions can have their engine explicitly configured:
See Function Config for details.
Performance Comparison
Benchmark results on 10M rows:
| Operation | pandas (ms) | chdb (ms) | Speedup |
|---|---|---|---|
| GroupBy count | 347 | 17 | 19.93x |
| Combined ops | 1,535 | 234 | 6.56x |
| Complex pipeline | 2,047 | 380 | 5.39x |
| Filter+Sort+Head | 1,537 | 350 | 4.40x |
| GroupBy agg | 406 | 141 | 2.88x |
| Single filter | 276 | 526 | 0.52x |
Key insights:
- chDB excels at aggregations and complex pipelines
- pandas is slightly faster for simple single operations
- Use
automode to get the best of both