DataStore Execution Model
Understanding DataStore's lazy evaluation model is key to using it effectively and achieving optimal performance.
Lazy Evaluation
DataStore uses lazy evaluation - operations are not executed immediately but are recorded and compiled into optimized SQL queries. Execution happens only when results are actually needed.
Example: Lazy vs Eager
from chdb import datastore as pd
ds = pd.read_csv("sales.csv")
# These operations are NOT executed yet
result = (ds
.filter(ds['amount'] > 1000) # Recorded, not executed
.select('region', 'amount') # Recorded, not executed
.groupby('region') # Recorded, not executed
.agg({'amount': 'sum'}) # Recorded, not executed
.sort('sum', ascending=False) # Recorded, not executed
)
# Still no execution - just building the query plan
print(result.to_sql())
# SELECT region, SUM(amount) AS sum
# FROM file('sales.csv', 'CSVWithNames')
# WHERE amount > 1000
# GROUP BY region
# ORDER BY sum DESC
# NOW execution happens
df = result.to_df() # <-- Triggers execution
Benefits of Lazy Evaluation
- Query Optimization: Multiple operations compile to a single optimized SQL query
- Filter Pushdown: Filters are applied at the data source level
- Column Pruning: Only needed columns are read
- Deferred Decisions: Execution engine can be chosen at runtime
- Plan Inspection: You can view/debug the query before executing
Execution Triggers
Execution is triggered automatically when you need actual values:
Automatic Triggers
| Trigger | Example | Description |
|---|
print() / repr() | print(ds) | Display results |
len() | len(ds) | Get row count |
.columns | ds.columns | Get column names |
.dtypes | ds.dtypes | Get column types |
.shape | ds.shape | Get dimensions |
.index | ds.index | Get row index |
.values | ds.values | Get NumPy array |
| Iteration | for row in ds | Iterate over rows |
to_df() | ds.to_df() | Convert to pandas |
to_pandas() | ds.to_pandas() | Alias for to_df |
to_dict() | ds.to_dict() | Convert to dict |
to_numpy() | ds.to_numpy() | Convert to array |
.equals() | ds.equals(other) | Compare DataStores |
Examples:
# All these trigger execution
print(ds) # Display
len(ds) # 1000
ds.columns # Index(['name', 'age', 'city'])
ds.shape # (1000, 3)
list(ds) # List of values
ds.to_df() # pandas DataFrame
Operations That Stay Lazy
| Operation | Returns | Description |
|---|
filter() | DataStore | Adds WHERE clause |
select() | DataStore | Adds column selection |
sort() | DataStore | Adds ORDER BY |
groupby() | LazyGroupBy | Prepares GROUP BY |
join() | DataStore | Adds JOIN |
ds['col'] | ColumnExpr | Column reference |
ds[['col1', 'col2']] | DataStore | Column selection |
Examples:
# These do NOT trigger execution - they stay lazy
result = ds.filter(ds['age'] > 25) # Returns DataStore
result = ds.select('name', 'age') # Returns DataStore
result = ds['name'] # Returns ColumnExpr
result = ds.groupby('city') # Returns LazyGroupBy
Three-Phase Execution
DataStore operations follow a three-phase execution model:
Phase 1: SQL Query Building (Lazy)
Operations that can be expressed in SQL are accumulated:
result = (ds
.filter(ds['status'] == 'active') # WHERE
.select('user_id', 'amount') # SELECT
.groupby('user_id') # GROUP BY
.agg({'amount': 'sum'}) # SUM()
.sort('sum', ascending=False) # ORDER BY
.limit(10) # LIMIT
)
# All compiled into one SQL query
Phase 2: Execution Point
When a trigger occurs, the accumulated SQL is executed:
# Execution triggered here
df = result.to_df()
# The single optimized SQL query runs now
Phase 3: DataFrame Operations (if any)
If you chain pandas-only operations after execution:
# Mixed operations
result = (ds
.filter(ds['amount'] > 100) # Phase 1: SQL
.to_df() # Phase 2: Execute
.pivot_table(...) # Phase 3: pandas
)
Viewing Execution Plans
Use explain() to see what will be executed:
ds = pd.read_csv("sales.csv")
query = (ds
.filter(ds['amount'] > 1000)
.groupby('region')
.agg({'amount': ['sum', 'mean']})
)
# View execution plan
query.explain()
Output:
Pipeline:
1. Source: file('sales.csv', 'CSVWithNames')
2. Filter: amount > 1000
3. GroupBy: region
4. Aggregate: sum(amount), avg(amount)
Generated SQL:
SELECT region, SUM(amount) AS sum, AVG(amount) AS mean
FROM file('sales.csv', 'CSVWithNames')
WHERE amount > 1000
GROUP BY region
Use verbose=True for more details:
query.explain(verbose=True)
See Debugging: explain() for complete documentation.
Caching
DataStore caches execution results to avoid redundant queries.
How Caching Works
ds = pd.read_csv("data.csv")
result = ds.filter(ds['age'] > 25)
# First access - executes query
print(result.shape) # Executes and caches
# Second access - uses cache
print(result.columns) # Uses cached result
# Third access - uses cache
df = result.to_df() # Uses cached result
Cache Invalidation
Cache is invalidated when operations modify the DataStore:
result = ds.filter(ds['age'] > 25)
print(result.shape) # Executes, caches
# New operation invalidates cache
result2 = result.filter(result['city'] == 'NYC')
print(result2.shape) # Re-executes (different query)
Manual Cache Control
# Clear cache
ds.clear_cache()
# Disable caching
from chdb.datastore.config import config
config.set_cache_enabled(False)
Mixing SQL and Pandas Operations
DataStore intelligently handles operations that mix SQL and pandas:
SQL-Compatible Operations
These compile to SQL:
filter(), where()
select()
groupby(), agg()
sort(), orderby()
limit(), offset()
join(), union()
distinct()
- Column operations (math, comparison, string methods)
Pandas-Only Operations
These trigger execution and use pandas:
apply() with custom functions
pivot_table() with complex aggregations
stack(), unstack()
- Operations on executed DataFrames
Hybrid Pipelines
# SQL phase
result = (ds
.filter(ds['amount'] > 100) # SQL
.groupby('category') # SQL
.agg({'amount': 'sum'}) # SQL
)
# Execution + pandas phase
result = (result
.to_df() # Execute SQL
.pivot_table(...) # pandas operation
)
Execution Engine Selection
DataStore can execute operations using different engines:
Auto Mode (Default)
from chdb.datastore.config import config
config.set_execution_engine('auto') # Default
# Automatically selects best engine per operation
Force chDB Engine
config.set_execution_engine('chdb')
# All operations use ClickHouse SQL
Force pandas Engine
config.set_execution_engine('pandas')
# All operations use pandas
See Configuration: Execution Engine for details.
Good: Filter Early
# Good: Filter in SQL, then aggregate
result = (ds
.filter(ds['date'] >= '2024-01-01') # Reduces data early
.groupby('category')
.agg({'amount': 'sum'})
)
Bad: Filter Late
# Bad: Aggregate all, then filter
result = (ds
.groupby('category')
.agg({'amount': 'sum'})
.to_df()
.query('sum > 1000') # Pandas filter after aggregation
)
Good: Select Columns Early
# Good: Select columns in SQL
result = (ds
.select('user_id', 'amount', 'date')
.filter(ds['date'] >= '2024-01-01')
.groupby('user_id')
.agg({'amount': 'sum'})
)
Good: Let SQL Do the Work
# Good: Complex aggregation in SQL
result = (ds
.groupby('category')
.agg({
'amount': ['sum', 'mean', 'count'],
'quantity': 'sum'
})
.sort('sum', ascending=False)
.limit(10)
)
# One SQL query does everything
# Bad: Multiple separate queries
sums = ds.groupby('category')['amount'].sum().to_df()
means = ds.groupby('category')['amount'].mean().to_df()
# Two queries instead of one
Best Practices Summary
- Chain operations before executing - Build the full query, then trigger once
- Filter early - Reduce data at the source
- Select only needed columns - Column pruning improves performance
- Use
explain() to understand execution - Debug before running
- Let SQL handle aggregations - ClickHouse is optimized for this
- Be aware of execution triggers - Avoid accidental early execution
- Use caching wisely - Understand when cache is invalidated