DataStore Factory Methods
DataStore provides over 20 factory methods to create instances from various data sources including local files, databases, cloud storage, and data lakes.
Universal URI Interface
The uri() method is the recommended universal entry point that auto-detects the source type:
URI Syntax Reference
| Source Type | URI Format | Example |
|---|---|---|
| Local file | path/to/file | data.csv, /abs/path/data.parquet |
| S3 | s3://bucket/path | s3://mybucket/data.parquet?nosign=true |
| GCS | gs://bucket/path | gs://mybucket/data.csv |
| Azure | az://container/path | az://mycontainer/data.parquet |
| HTTP/HTTPS | https://url | https://example.com/data.csv |
| MySQL | mysql://user:pass@host:port/db/table | mysql://root:pass@localhost:3306/mydb/users |
| PostgreSQL | postgresql://user:pass@host:port/db/table | postgresql://postgres:pass@localhost:5432/mydb/users |
| SQLite | sqlite:///path?table=name | sqlite:///data.db?table=users |
| ClickHouse | clickhouse://host:port/db/table | clickhouse://localhost:9000/default/hits |
File Sources
from_file
Create DataStore from a local or remote file with automatic format detection.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
path | str | required | File path (local or URL) |
format | str | None | File format (auto-detected if None) |
compression | str | None | Compression type (auto-detected if None) |
Supported formats: CSV, TSV, Parquet, JSON, JSONLines, ORC, Avro, Arrow
Examples:
Pandas-Compatible Read Functions
Cloud Storage
from_s3
Create DataStore from Amazon S3.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | S3 URL (s3://bucket/path) |
access_key_id | str | None | AWS access key ID |
secret_access_key | str | None | AWS secret access key |
format | str | None | File format (auto-detected) |
Examples:
from_gcs
Create DataStore from Google Cloud Storage.
Examples:
from_azure
Create DataStore from Azure Blob Storage.
Examples:
from_hdfs
Create DataStore from HDFS.
Examples:
from_url
Create DataStore from HTTP/HTTPS URL.
Examples:
Databases
from_mysql
Create DataStore from MySQL database.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
host | str | required | MySQL host |
database | str | required | Database name |
table | str | required | Table name |
user | str | required | Username |
password | str | required | Password |
port | int | 3306 | Port number |
Examples:
from_postgresql
Create DataStore from PostgreSQL database.
Examples:
from_clickhouse
Create DataStore from ClickHouse server.
Examples:
from_mongodb
Create DataStore from MongoDB.
Examples:
from_sqlite
Create DataStore from SQLite database.
Examples:
Data Lakes
from_iceberg
Create DataStore from Apache Iceberg table.
Examples:
from_delta
Create DataStore from Delta Lake table.
Examples:
from_hudi
Create DataStore from Apache Hudi table.
Examples:
In-Memory Sources
from_df / from_dataframe
Create DataStore from pandas DataFrame.
Examples:
DataFrame Constructor
Create DataStore using pandas-like constructor.
Special Sources
from_numbers
Create DataStore with sequential numbers (useful for testing).
Examples:
from_random
Create DataStore with random data.
Examples:
run_sql
Create DataStore from raw SQL query.
Examples:
Summary Table
| Method | Source Type | Example |
|---|---|---|
uri() | Universal | DataStore.uri("s3://bucket/data.parquet") |
from_file() | Local/Remote files | DataStore.from_file("data.csv") |
read_csv() | CSV files | pd.read_csv("data.csv") |
read_parquet() | Parquet files | pd.read_parquet("data.parquet") |
from_s3() | Amazon S3 | DataStore.from_s3("s3://bucket/path") |
from_gcs() | Google Cloud Storage | DataStore.from_gcs("gs://bucket/path") |
from_azure() | Azure Blob | DataStore.from_azure("az://container/path") |
from_hdfs() | HDFS | DataStore.from_hdfs("hdfs://host/path") |
from_url() | HTTP/HTTPS | DataStore.from_url("https://example.com/data.csv") |
from_mysql() | MySQL | DataStore.from_mysql(host, db, table, user, pass) |
from_postgresql() | PostgreSQL | DataStore.from_postgresql(host, db, table, user, pass) |
from_clickhouse() | ClickHouse | DataStore.from_clickhouse(host, db, table) |
from_mongodb() | MongoDB | DataStore.from_mongodb(uri, db, collection) |
from_sqlite() | SQLite | DataStore.from_sqlite("data.db", table) |
from_iceberg() | Apache Iceberg | DataStore.from_iceberg("/path/to/table") |
from_delta() | Delta Lake | DataStore.from_delta("/path/to/table") |
from_hudi() | Apache Hudi | DataStore.from_hudi("/path/to/table") |
from_df() | pandas DataFrame | DataStore.from_df(pandas_df) |
DataFrame() | Dictionary/DataFrame | pd.DataFrame({'a': [1, 2, 3]}) |
from_numbers() | Sequential numbers | DataStore.from_numbers(1000000) |
from_random() | Random data | DataStore.from_random(rows=1000, columns=5) |
run_sql() | Raw SQL | DataStore.run_sql("SELECT * FROM ...") |