DataStore: Pandas-Compatible API with SQL Optimization
DataStore is chDB's pandas-compatible API that combines the familiar pandas DataFrame interface with the power of SQL query optimization. Write pandas-style code, get ClickHouse performance.
Key Features
- Pandas Compatibility: 209 pandas DataFrame methods, 56
.strmethods, 42+.dtmethods - SQL Optimization: Operations automatically compile to optimized SQL queries
- Lazy Evaluation: Operations are deferred until results are needed
- 630+ API Methods: Comprehensive API surface for data manipulation
- ClickHouse Extensions: Additional accessors (
.arr,.json,.url,.ip,.geo) not available in pandas
Architecture
DataStore uses lazy evaluation with dual-engine execution:
- Lazy Operation Chain: Operations are recorded, not executed immediately
- Smart Engine Selection: QueryPlanner routes each segment to optimal engine (chDB for SQL, Pandas for complex ops)
- Intermediate Caching: Results cached at each step for fast iterative exploration
See Execution Model for details.
One-Line Migration from Pandas
Your existing pandas code works unchanged, but now runs on the ClickHouse engine.
Performance Comparison
DataStore delivers significant performance improvements over pandas, especially for aggregation and complex pipelines:
| Operation | Pandas | DataStore | Speedup |
|---|---|---|---|
| GroupBy count | 347ms | 17ms | 19.93x |
| Complex pipeline | 2,047ms | 380ms | 5.39x |
| Filter+Sort+Head | 1,537ms | 350ms | 4.40x |
| GroupBy agg | 406ms | 141ms | 2.88x |
Benchmark on 10M rows. See benchmark script and Performance Guide for details.
When to Use DataStore
Use DataStore when:
- Working with large datasets (millions of rows)
- Performing aggregations and groupby operations
- Querying data from files, databases, or cloud storage
- Building complex data pipelines
- You want pandas API with better performance
Use raw SQL API when:
- You prefer writing SQL directly
- You need fine-grained control over query execution
- Working with ClickHouse-specific features not exposed in pandas API
Feature Comparison
| Feature | Pandas | Polars | DuckDB | DataStore |
|---|---|---|---|---|
| Pandas API compatible | - | Partial | No | Full |
| Lazy evaluation | No | Yes | Yes | Yes |
| SQL query support | No | Yes | Yes | Yes |
| ClickHouse functions | No | No | No | Yes |
| String/DateTime accessors | Yes | Yes | No | Yes + extras |
| Array/JSON/URL/IP/Geo | No | Partial | No | Yes |
| Direct file queries | No | Yes | Yes | Yes |
| Cloud storage support | No | Limited | Yes | Yes |
API Statistics
| Category | Count | Coverage |
|---|---|---|
| DataFrame methods | 209 | 100% of pandas |
| Series.str accessor | 56 | 100% of pandas |
| Series.dt accessor | 42+ | 100%+ (includes ClickHouse extras) |
| Series.arr accessor | 37 | ClickHouse-specific |
| Series.json accessor | 13 | ClickHouse-specific |
| Series.url accessor | 15 | ClickHouse-specific |
| Series.ip accessor | 9 | ClickHouse-specific |
| Series.geo accessor | 14 | ClickHouse-specific |
| Total API methods | 630+ | - |
Documentation Navigation
Getting Started
- Quickstart - Installation and basic usage
- Migration from Pandas - Step-by-step migration guide
API Reference
- Factory Methods - Creating DataStore from various sources
- Query Building - SQL-style query operations
- Pandas Compatibility - All 209 pandas-compatible methods
- Accessors - String, DateTime, Array, JSON, URL, IP, Geo accessors
- Aggregation - Aggregate and window functions
- I/O Operations - Reading and writing data
Advanced Topics
- Execution Model - Lazy evaluation and caching
- Class Reference - Complete API reference
Configuration & Debugging
- Configuration - All configuration options
- Debugging - Explain, profiling, and logging
Pandas User Guides
- Pandas Cookbook - Common patterns
- Key Differences - Important differences from pandas
- Performance Guide - Optimization tips
- SQL for Pandas Users - Understanding the SQL behind pandas operations
Quick Example
Next Steps
- New to DataStore? Start with the Quickstart Guide
- Coming from pandas? Read the Migration Guide
- Want to learn more? Explore the API Reference