DataStore 类参考 - ClickHouse Documentation

本参考文档介绍了 DataStore API 的核心类。

DataStore

用于数据处理的主要类，类似于 DataFrame。

from chdb.datastore import DataStore

构造函数

DataStore(data=None, columns=None, index=None, dtype=None, copy=None)

参数：

参数	类型	说明
`data`	dict/list/DataFrame/DataStore	输入数据
`columns`	list	列名
`index`	Index	行索引
`dtype`	dict	列数据类型
`copy`	bool	复制数据

示例：

# 从字典创建
ds = DataStore({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})

# 从 pandas DataFrame 创建
import pandas as pd
ds = DataStore(pd.DataFrame({'a': [1, 2, 3]}))

# 空 DataStore
ds = DataStore()

属性

Property	Type	Description
`columns`	Index	列名称
`dtypes`	Series	列的数据类型
`shape`	tuple	(行数, 列数)
`size`	int	元素总数
`ndim`	int	维度数量 (2)
`empty`	bool	DataFrame 是否为空
`values`	ndarray	底层数据的 NumPy 数组表示
`index`	Index	行索引
`T`	DataStore	转置
`axes`	list	轴列表

工厂方法

方法	说明
`uri(uri)`	通用 URI 工厂方法
`from_file(path, ...)`	从文件创建
`from_df(df)`	从 pandas DataFrame 创建
`from_s3(url, ...)`	从 S3 创建
`from_gcs(url, ...)`	从 Google Cloud Storage 创建
`from_azure(url, ...)`	从 Azure Blob 创建
`from_mysql(...)`	从 MySQL 创建
`from_postgresql(...)`	从 PostgreSQL 创建
`from_clickhouse(...)`	从 ClickHouse 创建
`from_mongodb(...)`	从 MongoDB 创建
`from_sqlite(...)`	从 SQLite 创建
`from_iceberg(path)`	从 Iceberg 表创建
`from_delta(path)`	从 Delta Lake 创建
`from_numbers(n)`	使用连续数字创建
`from_random(rows, cols)`	使用随机数据创建
`run_sql(query)`	从 SQL 查询创建

详见工厂方法。

查询方法

方法	返回值	描述
`select(*cols)`	DataStore	选择列
`filter(condition)`	DataStore	过滤行
`where(condition)`	DataStore	`filter` 的别名
`sort(*cols, ascending=True)`	DataStore	对行进行排序
`orderby(*cols)`	DataStore	`sort` 的别名
`limit(n)`	DataStore	限制返回行数
`offset(n)`	DataStore	跳过指定行数
`distinct(subset=None)`	DataStore	去除重复项
`groupby(*cols)`	LazyGroupBy	对行分组
`having(condition)`	DataStore	过滤分组结果
`join(right, ...)`	DataStore	连接 DataStore
`union(other, all=False)`	DataStore	合并 DataStore
`when(cond, val)`	CaseWhen	CASE WHEN

详见查询构建。

兼容 Pandas 的方法

完整的 209 个方法列表请参见 Pandas 兼容性。 索引： head(), tail(), sample(), loc, iloc, at, iat, query(), isin(), where(), mask(), get(), xs(), pop() 聚合： sum(), mean(), std(), var(), min(), max(), median(), count(), nunique(), quantile(), describe(), corr(), cov(), skew(), kurt() 操作： drop(), drop_duplicates(), dropna(), fillna(), replace(), rename(), assign(), astype(), copy() 排序： sort_values(), sort_index(), nlargest(), nsmallest(), rank() 重塑： pivot(), pivot_table(), melt(), stack(), unstack(), transpose(), explode(), squeeze() 组合： merge(), join(), concat(), append(), combine(), update(), compare() 应用/转换： apply(), applymap(), map(), agg(), transform(), pipe(), groupby() 时间序列： rolling(), expanding(), ewm(), shift(), diff(), pct_change(), resample()

I/O 方法

方法	说明
`to_csv(path, ...)`	导出为 CSV
`to_parquet(path, ...)`	导出为 Parquet
`to_json(path, ...)`	导出为 JSON
`to_excel(path, ...)`	导出为 Excel
`to_df()`	转换为 pandas DataFrame
`to_pandas()`	`to_df` 的别名
`to_arrow()`	转换为 Arrow 表
`to_dict(orient)`	转换为字典
`to_records()`	转换为记录
`to_numpy()`	转换为 NumPy 数组
`to_sql()`	生成 SQL 字符串
`to_string()`	字符串表示形式
`to_markdown()`	Markdown 表
`to_html()`	HTML 表

详见 I/O Operations。

调试方法

方法	说明
`explain(verbose=False)`	显示执行计划
`clear_cache()`	清除缓存结果

详见调试。

魔术方法

方法	描述
`__getitem__(key)`	`ds['col']`, `ds[['a', 'b']]`, `ds[condition]`
`__setitem__(key, value)`	`ds['col'] = value`
`__delitem__(key)`	`del ds['col']`
`__len__()`	`len(ds)`
`__iter__()`	`for col in ds`
`__contains__(key)`	`'col' in ds`
`__repr__()`	`repr(ds)`
`__str__()`	`str(ds)`
`__eq__(other)`	`ds == other`
`__ne__(other)`	`ds != other`
`__lt__(other)`	`ds < other`
`__le__(other)`	`ds <= other`
`__gt__(other)`	`ds > other`
`__ge__(other)`	`ds >= other`
`__add__(other)`	`ds + other`
`__sub__(other)`	`ds - other`
`__mul__(other)`	`ds * other`
`__truediv__(other)`	`ds / other`
`__floordiv__(other)`	`ds // other`
`__mod__(other)`	`ds % other`
`__pow__(other)`	`ds ** other`
`__and__(other)`	`ds & other`
`__or__(other)`	`ds	other`
`__invert__()`	`~ds`
`__neg__()`	`-ds`
`__pos__()`	`+ds`
`__abs__()`	`abs(ds)`

ColumnExpr

表示用于惰性求值的列表达式。在访问列时会返回该表达式。

# ColumnExpr 会自动返回
col = ds['name']  # 返回 ColumnExpr

属性

属性	类型	说明
`name`	str	列名
`dtype`	dtype	数据类型

访问器

访问器	描述	方法
`.str`	字符串操作	56 个方法
`.dt`	DateTime 操作	42+ 个方法
`.arr`	数组操作	37 个方法
`.json`	JSON 解析	13 个方法
`.url`	URL 解析	15 个方法
`.ip`	IP 地址操作	9 个方法
`.geo`	地理空间/距离操作	14 个方法

完整文档请参见访问器。

算术运算

ds['total'] = ds['price'] * ds['quantity']
ds['profit'] = ds['revenue'] - ds['cost']
ds['ratio'] = ds['a'] / ds['b']
ds['squared'] = ds['value'] ** 2
ds['remainder'] = ds['value'] % 10

比较运算

ds[ds['age'] > 25]           # 大于
ds[ds['age'] >= 25]          # 大于等于
ds[ds['age'] < 25]           # 小于
ds[ds['age'] <= 25]          # 小于等于
ds[ds['name'] == 'Alice']    # 等于
ds[ds['name'] != 'Bob']      # 不等于

逻辑运算

ds[(ds['age'] > 25) & (ds['city'] == 'NYC')]    # 与
ds[(ds['age'] > 25) | (ds['city'] == 'NYC')]    # 或
ds[~(ds['status'] == 'inactive')]               # 非

方法

方法	描述
`as_(alias)`	设置别名
`cast(dtype)`	转换为指定类型
`astype(dtype)`	`cast` 的别名
`isnull()`	是否为 NULL
`notnull()`	是否不为 NULL
`isna()`	`isnull` 的别名
`notna()`	`notnull` 的别名
`isin(values)`	是否在值列表中
`between(low, high)`	是否介于两个值之间
`fillna(value)`	填充 NULL 值
`replace(to_replace, value)`	替换值
`clip(lower, upper)`	截断值
`abs()`	绝对值
`round(decimals)`	四舍五入
`floor()`	向下取整
`ceil()`	向上取整
`apply(func)`	应用函数
`map(mapper)`	映射值

聚合方法

方法	描述
`sum()`	求和
`mean()`	平均值
`avg()`	`mean()` 的别名
`min()`	最小值
`max()`	最大值
`count()`	非 NULL 值计数
`nunique()`	去重计数
`std()`	标准差
`var()`	方差
`median()`	中位数
`quantile(q)`	分位数
`first()`	第一个值
`last()`	最后一个值
`any()`	任意值为 true
`all()`	所有值均为 true

LazyGroupBy

表示用于聚合操作的分组 DataStore。

# LazyGroupBy 会自动返回
grouped = ds.groupby('category')  # 返回 LazyGroupBy

方法

方法	返回值	说明
`agg(spec)`	DataStore	聚合
`aggregate(spec)`	DataStore	`agg` 的别名
`sum()`	DataStore	按组求和
`mean()`	DataStore	按组求平均值
`count()`	DataStore	按组计数
`min()`	DataStore	按组求最小值
`max()`	DataStore	按组求最大值
`std()`	DataStore	按组求标准差
`var()`	DataStore	按组求方差
`median()`	DataStore	按组求中位数
`nunique()`	DataStore	按组统计唯一值数量
`first()`	DataStore	按组取第一个值
`last()`	DataStore	按组取最后一个值
`nth(n)`	DataStore	按组取第 n 个值
`head(n)`	DataStore	按组取前 n 个值
`tail(n)`	DataStore	按组取后 n 个值
`apply(func)`	DataStore	对每组应用函数
`transform(func)`	DataStore	按组转换
`filter(func)`	DataStore	过滤分组

列选择

# 在 groupby 之后选择列
grouped['amount'].sum()     # 返回 DataStore
grouped[['a', 'b']].sum()   # 返回 DataStore

聚合规范

# 单一聚合
grouped.agg({'amount': 'sum'})

# 每列多个聚合
grouped.agg({'amount': ['sum', 'mean', 'count']})

# 命名聚合
grouped.agg(
    total=('amount', 'sum'),
    average=('amount', 'mean'),
    count=('id', 'count')
)

LazySeries

表示惰性 Series (即单列) 。

属性

属性	类型	描述
`name`	str	Series 名称
`dtype`	dtype	数据类型

方法

继承了 ColumnExpr 的大多数方法。主要方法包括：

方法	说明
`value_counts()`	值频次
`unique()`	去重后的值
`nunique()`	唯一值数量
`mode()`	众数
`to_list()`	转换为列表
`to_numpy()`	转换为数组
`to_frame()`	转换为 DataStore

F (函数)

ClickHouse 函数所在的命名空间。

from chdb.datastore import F, Field

# 聚合
F.sum(Field('amount'))
F.avg(Field('price'))
F.count(Field('id'))
F.quantile(Field('value'), 0.95)

# 条件
F.sum_if(Field('amount'), Field('status') == 'completed')
F.count_if(Field('active'))

# Window
F.row_number().over(order_by='date')
F.lag('price', 1).over(partition_by='product', order_by='date')

详情请参阅聚合。

字段

通过名称引用列。

from chdb.datastore import Field

# 创建字段引用
amount = Field('amount')
price = Field('price')

# 在表达式中使用
F.sum(Field('amount'))
F.avg(Field('price'))

CaseWhen

CASE WHEN 表达式构建器。

# 创建 case-when 表达式
result = (ds
    .when(ds['score'] >= 90, 'A')
    .when(ds['score'] >= 80, 'B')
    .when(ds['score'] >= 70, 'C')
    .otherwise('F')
)

# 赋值给列
ds['grade'] = result

Window

窗口函数的窗口规范。

from chdb.datastore import F

# 创建窗口
window = F.window(
    partition_by='category',
    order_by='date',
    rows_between=(-7, 0)
)

# 与聚合函数配合使用
ds['rolling_avg'] = F.avg('price').over(window)

​DataStore

​构造函数

​属性

​工厂方法

​查询方法

​兼容 Pandas 的方法

​I/O 方法

​调试方法

​魔术方法

​ColumnExpr

​属性

​访问器

​算术运算

​比较运算

​逻辑运算

​方法

​聚合方法

​LazyGroupBy

​方法

​列选择

​聚合规范

​LazySeries

​属性

​方法

​相关类

​F (函数)

​字段

​CaseWhen

​Window