Skip to main content
Skip to main content

DataStore Class Reference

This reference documents the core classes in the DataStore API.

DataStore

The main DataFrame-like class for data manipulation.

from chdb.datastore import DataStore

Constructor

DataStore(data=None, columns=None, index=None, dtype=None, copy=None)

Parameters:

ParameterTypeDescription
datadict/list/DataFrame/DataStoreInput data
columnslistColumn names
indexIndexRow index
dtypedictColumn data types
copyboolCopy data

Examples:

# From dictionary
ds = DataStore({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})

# From pandas DataFrame
import pandas as pd
ds = DataStore(pd.DataFrame({'a': [1, 2, 3]}))

# Empty DataStore
ds = DataStore()

Properties

PropertyTypeDescription
columnsIndexColumn names
dtypesSeriesColumn data types
shapetuple(rows, columns)
sizeintTotal elements
ndimintNumber of dimensions (2)
emptyboolIs DataFrame empty
valuesndarrayUnderlying data as NumPy array
indexIndexRow index
TDataStoreTranspose
axeslistList of axes

Factory Methods

MethodDescription
uri(uri)Universal factory from URI
from_file(path, ...)Create from file
from_df(df)Create from pandas DataFrame
from_s3(url, ...)Create from S3
from_gcs(url, ...)Create from Google Cloud Storage
from_azure(url, ...)Create from Azure Blob
from_mysql(...)Create from MySQL
from_postgresql(...)Create from PostgreSQL
from_clickhouse(...)Create from ClickHouse
from_mongodb(...)Create from MongoDB
from_sqlite(...)Create from SQLite
from_iceberg(path)Create from Iceberg table
from_delta(path)Create from Delta Lake
from_numbers(n)Create with sequential numbers
from_random(rows, cols)Create with random data
run_sql(query)Create from SQL query

See Factory Methods for details.

Query Methods

MethodReturnsDescription
select(*cols)DataStoreSelect columns
filter(condition)DataStoreFilter rows
where(condition)DataStoreAlias for filter
sort(*cols, ascending=True)DataStoreSort rows
orderby(*cols)DataStoreAlias for sort
limit(n)DataStoreLimit rows
offset(n)DataStoreSkip rows
distinct(subset=None)DataStoreRemove duplicates
groupby(*cols)LazyGroupByGroup rows
having(condition)DataStoreFilter groups
join(right, ...)DataStoreJoin DataStores
union(other, all=False)DataStoreCombine DataStores
when(cond, val)CaseWhenCASE WHEN

See Query Building for details.

Pandas-Compatible Methods

See Pandas Compatibility for the complete list of 209 methods.

Indexing: head(), tail(), sample(), loc, iloc, at, iat, query(), isin(), where(), mask(), get(), xs(), pop()

Aggregation: sum(), mean(), std(), var(), min(), max(), median(), count(), nunique(), quantile(), describe(), corr(), cov(), skew(), kurt()

Manipulation: drop(), drop_duplicates(), dropna(), fillna(), replace(), rename(), assign(), astype(), copy()

Sorting: sort_values(), sort_index(), nlargest(), nsmallest(), rank()

Reshaping: pivot(), pivot_table(), melt(), stack(), unstack(), transpose(), explode(), squeeze()

Combining: merge(), join(), concat(), append(), combine(), update(), compare()

Apply/Transform: apply(), applymap(), map(), agg(), transform(), pipe(), groupby()

Time Series: rolling(), expanding(), ewm(), shift(), diff(), pct_change(), resample()

I/O Methods

MethodDescription
to_csv(path, ...)Export to CSV
to_parquet(path, ...)Export to Parquet
to_json(path, ...)Export to JSON
to_excel(path, ...)Export to Excel
to_df()Convert to pandas DataFrame
to_pandas()Alias for to_df
to_arrow()Convert to Arrow Table
to_dict(orient)Convert to dictionary
to_records()Convert to records
to_numpy()Convert to NumPy array
to_sql()Generate SQL string
to_string()String representation
to_markdown()Markdown table
to_html()HTML table

See I/O Operations for details.

Debugging Methods

MethodDescription
explain(verbose=False)Show execution plan
clear_cache()Clear cached results

See Debugging for details.

Magic Methods

MethodDescription
__getitem__(key)ds['col'], ds[['a', 'b']], ds[condition]
__setitem__(key, value)ds['col'] = value
__delitem__(key)del ds['col']
__len__()len(ds)
__iter__()for col in ds
__contains__(key)'col' in ds
__repr__()repr(ds)
__str__()str(ds)
__eq__(other)ds == other
__ne__(other)ds != other
__lt__(other)ds < other
__le__(other)ds <= other
__gt__(other)ds > other
__ge__(other)ds >= other
__add__(other)ds + other
__sub__(other)ds - other
__mul__(other)ds * other
__truediv__(other)ds / other
__floordiv__(other)ds // other
__mod__(other)ds % other
__pow__(other)ds ** other
__and__(other)ds & other
__or__(other)`ds
__invert__()~ds
__neg__()-ds
__pos__()+ds
__abs__()abs(ds)

ColumnExpr

Represents a column expression for lazy evaluation. Returned when accessing a column.

# ColumnExpr is returned automatically
col = ds['name']  # Returns ColumnExpr

Properties

PropertyTypeDescription
namestrColumn name
dtypedtypeData type

Accessors

AccessorDescriptionMethods
.strString operations56 methods
.dtDateTime operations42+ methods
.arrArray operations37 methods
.jsonJSON parsing13 methods
.urlURL parsing15 methods
.ipIP address operations9 methods
.geoGeo/distance operations14 methods

See Accessors for complete documentation.

Arithmetic Operations

ds['total'] = ds['price'] * ds['quantity']
ds['profit'] = ds['revenue'] - ds['cost']
ds['ratio'] = ds['a'] / ds['b']
ds['squared'] = ds['value'] ** 2
ds['remainder'] = ds['value'] % 10

Comparison Operations

ds[ds['age'] > 25]           # Greater than
ds[ds['age'] >= 25]          # Greater or equal
ds[ds['age'] < 25]           # Less than
ds[ds['age'] <= 25]          # Less or equal
ds[ds['name'] == 'Alice']    # Equal
ds[ds['name'] != 'Bob']      # Not equal

Logical Operations

ds[(ds['age'] > 25) & (ds['city'] == 'NYC')]    # AND
ds[(ds['age'] > 25) | (ds['city'] == 'NYC')]    # OR
ds[~(ds['status'] == 'inactive')]               # NOT

Methods

MethodDescription
as_(alias)Set alias name
cast(dtype)Cast to type
astype(dtype)Alias for cast
isnull()Is NULL
notnull()Is not NULL
isna()Alias for isnull
notna()Alias for notnull
isin(values)In list of values
between(low, high)Between two values
fillna(value)Fill NULLs
replace(to_replace, value)Replace values
clip(lower, upper)Clip values
abs()Absolute value
round(decimals)Round values
floor()Floor
ceil()Ceiling
apply(func)Apply function
map(mapper)Map values

Aggregation Methods

MethodDescription
sum()Sum
mean()Mean
avg()Alias for mean
min()Minimum
max()Maximum
count()Count non-null
nunique()Unique count
std()Standard deviation
var()Variance
median()Median
quantile(q)Quantile
first()First value
last()Last value
any()Any true
all()All true

LazyGroupBy

Represents a grouped DataStore for aggregation operations.

# LazyGroupBy is returned automatically
grouped = ds.groupby('category')  # Returns LazyGroupBy

Methods

MethodReturnsDescription
agg(spec)DataStoreAggregate
aggregate(spec)DataStoreAlias for agg
sum()DataStoreSum per group
mean()DataStoreMean per group
count()DataStoreCount per group
min()DataStoreMin per group
max()DataStoreMax per group
std()DataStoreStd dev per group
var()DataStoreVariance per group
median()DataStoreMedian per group
nunique()DataStoreUnique count per group
first()DataStoreFirst value per group
last()DataStoreLast value per group
nth(n)DataStoreNth value per group
head(n)DataStoreFirst n per group
tail(n)DataStoreLast n per group
apply(func)DataStoreApply function per group
transform(func)DataStoreTransform per group
filter(func)DataStoreFilter groups

Column Selection

# Select column after groupby
grouped['amount'].sum()     # Returns DataStore
grouped[['a', 'b']].sum()   # Returns DataStore

Aggregation Specifications

# Single aggregation
grouped.agg({'amount': 'sum'})

# Multiple aggregations per column
grouped.agg({'amount': ['sum', 'mean', 'count']})

# Named aggregations
grouped.agg(
    total=('amount', 'sum'),
    average=('amount', 'mean'),
    count=('id', 'count')
)

LazySeries

Represents a lazy Series (single column).

Properties

PropertyTypeDescription
namestrSeries name
dtypedtypeData type

Methods

Inherits most methods from ColumnExpr. Key methods:

MethodDescription
value_counts()Value frequencies
unique()Unique values
nunique()Count unique
mode()Mode value
to_list()Convert to list
to_numpy()Convert to array
to_frame()Convert to DataStore

F (Functions)

Namespace for ClickHouse functions.

from chdb.datastore import F, Field

# Aggregations
F.sum(Field('amount'))
F.avg(Field('price'))
F.count(Field('id'))
F.quantile(Field('value'), 0.95)

# Conditional
F.sum_if(Field('amount'), Field('status') == 'completed')
F.count_if(Field('active'))

# Window
F.row_number().over(order_by='date')
F.lag('price', 1).over(partition_by='product', order_by='date')

See Aggregation for details.

Field

Reference to a column by name.

from chdb.datastore import Field

# Create field reference
amount = Field('amount')
price = Field('price')

# Use in expressions
F.sum(Field('amount'))
F.avg(Field('price'))

CaseWhen

Builder for CASE WHEN expressions.

# Create case-when expression
result = (ds
    .when(ds['score'] >= 90, 'A')
    .when(ds['score'] >= 80, 'B')
    .when(ds['score'] >= 70, 'C')
    .otherwise('F')
)

# Assign to column
ds['grade'] = result

Window

Window specification for window functions.

from chdb.datastore import F

# Create window
window = F.window(
    partition_by='category',
    order_by='date',
    rows_between=(-7, 0)
)

# Use with aggregation
ds['rolling_avg'] = F.avg('price').over(window)