DataStore 调试 - ClickHouse Documentation

DataStore 提供了全面的调试工具，帮助您理解并优化数据管道。

调试工具概览

工具	用途	使用时机
`explain()`	查看执行计划	了解将要运行的 SQL
Profiler	分析性能	找出慢操作
日志	查看执行细节	排查异常行为

快速决策矩阵

需求	工具	命令
查看执行计划	`explain()`	`ds.explain()`
评估性能	Profiler	`config.enable_profiling()`
调试 SQL 查询	日志	`config.enable_debug()`
以上全部	组合使用	见下文

快速设置

启用所有调试功能

from chdb import datastore as pd
from chdb.datastore.config import config

# 启用所有调试功能
config.enable_debug()        # 详细日志
config.enable_profiling()    # 性能追踪

ds = pd.read_csv("data.csv")
result = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'mean'})

# 查看执行计划
result.explain()

# 获取 Profiler 报告
from chdb.datastore.config import get_profiler
profiler = get_profiler()
profiler.report()

explain() 方法

在运行查询前查看执行计划。

Query

ds = pd.read_csv("data.csv")

query = (ds
    .filter(ds['amount'] > 1000)
    .groupby('region')
    .agg({'amount': ['sum', 'mean']})
)

# 查看执行计划
query.explain()

Response

Pipeline:
  Source: file('data.csv', 'CSVWithNames')
  Filter: amount > 1000
  GroupBy: region
  Aggregate: sum(amount), avg(amount)

Generated SQL:
SELECT region, SUM(amount) AS sum, AVG(amount) AS mean
FROM file('data.csv', 'CSVWithNames')
WHERE amount > 1000
GROUP BY region

详见 explain() 文档。

性能分析

测量每个操作的执行时间。

Query

from chdb.datastore.config import config, get_profiler

# 启用性能分析
config.enable_profiling()

# 运行操作
ds = pd.read_csv("large_data.csv")
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('category')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# 查看报告
profiler = get_profiler()
profiler.report(min_duration_ms=0.1)

Response

性能报告
==================
步骤                          耗时        调用次数
----                          --------    -----
read_csv                      1.234s      1
filter                        0.002s      1
groupby                       0.001s      1
agg                           0.089s      1
sort                          0.045s      1
head                          0.001s      1
to_df (SQL execution)         0.567s      1
----                          --------    -----
总计                          1.939s      7

详情请参见性能分析指南。

日志

查看详细的执行日志。

from chdb.datastore.config import config

# 启用调试日志
config.enable_debug()

# 运行操作 - 日志将显示：
# - 生成的 SQL 查询
# - 使用的执行引擎
# - 缓存命中/未命中
# - 耗时信息

日志输出示例：

DEBUG - DataStore: Creating from file 'data.csv'
DEBUG - Query: SELECT region, SUM(amount) FROM ... WHERE amount > 1000 GROUP BY region
DEBUG - Engine: Using chdb for aggregation
DEBUG - Execution time: 0.089s
DEBUG - Cache: Storing result (key: abc123)

详见日志配置。

常见调试场景

1. 查询结果不符合预期

# 步骤 1：查看执行计划
query = ds.filter(ds['age'] > 25).groupby('city').sum()
query.explain(verbose=True)

# 步骤 2：启用日志以查看 SQL
config.enable_debug()

# 步骤 3：运行并检查日志
result = query.to_df()

2. 查询执行缓慢

# 步骤 1：启用 性能分析
config.enable_profiling()

# 步骤 2：执行查询
result = process_data()

# 步骤 3：查看 Profiler 报告
profiler = get_profiler()
profiler.report()

# 步骤 4：定位慢操作并进行优化

3. 了解引擎选择

# 启用详细日志
config.enable_debug()

# 运行操作
result = ds.filter(ds['x'] > 10).apply(custom_func)

# 日志将显示每个操作所使用的引擎：
# DEBUG - filter: Using chdb engine
# DEBUG - apply: Using pandas engine (custom function)

4. 调试缓存问题

# 启用调试以查看缓存操作
config.enable_debug()

# 第一次运行
result1 = ds.filter(ds['x'] > 10).to_df()
# 日志：缓存未命中，正在执行查询

# 第二次运行（应使用缓存）
result2 = ds.filter(ds['x'] > 10).to_df()
# 日志：缓存命中，返回已缓存的结果

# 如果未按预期缓存，请检查：
# - 操作是否完全相同？
# - 缓存是否已启用？config.cache_enabled

最佳实践

1. 在开发环境而非生产环境中调试

# 开发环境
config.enable_debug()
config.enable_profiling()

# 生产环境
config.set_log_level(logging.WARNING)
config.set_profiling_enabled(False)

2. 运行大型查询前先使用 explain()

# 构建查询
query = ds.filter(...).groupby(...).agg(...)

# 先检查执行计划
query.explain()

# 如果执行计划没问题，再执行
result = query.to_df()

3. 先进行性能分析，再优化

# 不要靠猜测判断性能瓶颈，要用测量说话
config.enable_profiling()
result = your_pipeline()
get_profiler().report()

4. 当结果有误时检查 SQL

# 查看生成的 SQL
print(query.to_sql())

# 与预期 SQL 进行比较
# 直接在 ClickHouse 中运行 SQL 以验证

调试工具汇总

工具	命令	输出
查看执行计划	`ds.explain()`	执行步骤 + SQL
详细执行计划	`ds.explain(verbose=True)`	+ 元数据
查看 SQL	`ds.to_sql()`	SQL 查询字符串
启用调试	`config.enable_debug()`	详细日志
启用性能分析	`config.enable_profiling()`	耗时数据
Profiler 报告	`get_profiler().report()`	性能摘要
重置 Profiler	`get_profiler().reset()`	清除耗时数据

后续步骤

explain() 方法 - 执行计划详解文档
性能分析指南 - 性能分析
日志配置 - 日志级别和输出格式设置

​调试工具概览

​快速决策矩阵

​快速设置

​启用所有调试功能

​explain() 方法

​性能分析

​日志

​常见调试场景

​1. 查询结果不符合预期

​2. 查询执行缓慢

​3. 了解引擎选择

​4. 调试缓存问题

​最佳实践

​1. 在开发环境而非生产环境中调试

​2. 运行大型查询前先使用 explain()

​3. 先进行性能分析，再优化

​4. 当结果有误时检查 SQL

​调试工具汇总

​后续步骤

调试工具概览

快速决策矩阵

快速设置

启用所有调试功能

explain() 方法

性能分析

日志

常见调试场景

1. 查询结果不符合预期

2. 查询执行缓慢

3. 了解引擎选择

4. 调试缓存问题

最佳实践

1. 在开发环境而非生产环境中调试

2. 运行大型查询前先使用 explain()

3. 先进行性能分析，再优化

4. 当结果有误时检查 SQL

调试工具汇总

后续步骤