DataStore 프로파일링 - ClickHouse Documentation

DataStore 프로파일러는 실행 시간을 측정하고 성능 병목을 파악하는 데 도움이 됩니다.

빠른 시작

from chdb import datastore as pd
from chdb.datastore.config import config, get_profiler

# 프로파일링 활성화
config.enable_profiling()

# 작업 실행
ds = pd.read_csv("large_data.csv")
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('category')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# 보고서 확인
profiler = get_profiler()
print(profiler.report())

프로파일링 활성화

from chdb.datastore.config import config

# 프로파일링 활성화
config.enable_profiling()

# 프로파일링 비활성화
config.disable_profiling()

# 프로파일링 활성화 여부 확인
print(config.profiling_enabled)  # True or False

프로파일러 API

프로파일러 가져오는 방법

from chdb.datastore.config import get_profiler

profiler = get_profiler()

report()

성능 보고서를 출력합니다.

profiler.report(min_duration_ms=0.1)

매개변수:

매개변수	유형	기본값	설명
`min_duration_ms`	float	`0.1`	이 값 이상의 소요 시간을 가진 단계만 표시합니다

예시 출력:

======================================================================
EXECUTION PROFILE
======================================================================
   45.79ms (100.0%) Total Execution
     23.25ms ( 50.8%) Query Planning [ops_count=2]
     22.29ms ( 48.7%) SQL Segment 1 [ops=2]
       20.48ms ( 91.9%) SQL Execution
        1.74ms (  7.8%) Result to DataFrame
----------------------------------------------------------------------
      TOTAL:    45.79ms
======================================================================

보고서에는 다음이 표시됩니다:

각 단계의 소요 시간(밀리초)
상위 단계 시간/전체 시간 대비 비율
작업의 계층적 중첩 구조
각 단계의 메타데이터(예: ops_count, ops)

step()

코드 블록의 실행 시간을 수동으로 측정합니다.

with profiler.step("custom_operation"):
    # 여기에 코드를 작성하세요
    expensive_operation()

clear()

모든 프로파일링 데이터를 삭제합니다.

profiler.clear()

summary()

단계 이름과 소요 시간(ms)의 매핑이 담긴 딕셔너리를 가져옵니다.

summary = profiler.summary()
for name, duration in summary.items():
    print(f"{name}: {duration:.2f}ms")

출력 예시:

Total Execution: 45.79ms
Total Execution.Cache Check: 0.00ms
Total Execution.Query Planning: 23.25ms
Total Execution.SQL Segment 1: 22.29ms
Total Execution.SQL Segment 1.SQL Execution: 20.48ms
Total Execution.SQL Segment 1.Result to DataFrame: 1.74ms

보고서 살펴보기

단계 이름

단계 이름	설명
`Total Execution`	전체 실행 시간
`Query Planning`	쿼리 계획 수립에 소요된 시간
`SQL Segment N`	SQL 세그먼트 N 실행
`SQL Execution`	실제 SQL 쿼리 실행
`Result to DataFrame`	결과를 pandas로 변환
`Cache Check`	쿼리 캐시 확인
`Cache Write`	결과를 캐시에 쓰기

소요 시간

계획 단계 (쿼리 계획): 일반적으로 빠릅니다
실행 단계 (SQL 실행): 실제 작업이 수행되는 단계입니다
전송 단계 (결과를 DataFrame으로): 데이터를 pandas로 변환하는 단계입니다

병목 지점 파악

======================================================================
EXECUTION PROFILE
======================================================================
  200.50ms (100.0%) Total Execution
    10.25ms (  5.1%) Query Planning [ops_count=4]
   190.00ms ( 94.8%) SQL Segment 1 [ops=4]
     185.00ms ( 97.4%) SQL Execution    <- 주요 병목 지점
       5.00ms (  2.6%) Result to DataFrame
----------------------------------------------------------------------
      TOTAL:   200.50ms
======================================================================

프로파일링 패턴

단일 쿼리 프로파일링

config.enable_profiling()
profiler = get_profiler()
profiler.clear()  # 이전 데이터 초기화

# 쿼리 실행
result = ds.filter(...).groupby(...).agg(...).to_df()

# 이 쿼리의 프로파일 확인
print(profiler.report())

여러 쿼리 프로파일링하기

config.enable_profiling()
profiler = get_profiler()
profiler.clear()

# 쿼리 1
with profiler.step("Query 1"):
    result1 = query1.to_df()

# 쿼리 2
with profiler.step("Query 2"):
    result2 = query2.to_df()

print(profiler.report())

접근 방식 비교

profiler = get_profiler()

# 방법 1: 필터 후 groupby
profiler.clear()
with profiler.step("filter_then_groupby"):
    result1 = ds.filter(ds['x'] > 10).groupby('y').sum().to_df()
summary1 = profiler.summary()
time1 = summary1.get('filter_then_groupby', 0)

# 방법 2: Groupby 후 필터
profiler.clear()
with profiler.step("groupby_then_filter"):
    result2 = ds.groupby('y').sum().filter(ds['x'] > 10).to_df()
summary2 = profiler.summary()
time2 = summary2.get('groupby_then_filter', 0)

print(f"Approach 1: {time1:.2f}ms")
print(f"Approach 2: {time2:.2f}ms")
print(f"Winner: {'Approach 1' if time1 < time2 else 'Approach 2'}")

최적화 팁

1. SQL 실행 시간 확인

SQL execution이 병목이라면:

데이터 양을 줄일 수 있도록 필터를 더 추가합니다
CSV 대신 Parquet를 사용합니다
적절한 인덱스가 설정되어 있는지 확인합니다(데이터베이스 소스의 경우)

2. I/O 시간 확인

read_csv 또는 read_parquet가 병목이라면:

Parquet 사용(열 지향, 압축 형식)
필요한 컬럼만 읽기
가능하면 원본 데이터에서 필터링

3. 데이터 전송 확인

to_df가 느리다면:

결과 집합(result set)이 너무 클 수 있습니다
필터를 더 추가하거나 limit를 설정하세요
미리 보려면 head()를 사용하세요

4. 엔진 비교

from chdb.datastore.config import config

# chdb로 프로파일링
config.use_chdb()
profiler.clear()
result_chdb = query.to_df()
time_chdb = profiler.total_duration_ms

# pandas로 프로파일링
config.use_pandas()
profiler.clear()
result_pandas = query.to_df()
time_pandas = profiler.total_duration_ms

print(f"chdb: {time_chdb:.2f}ms")
print(f"pandas: {time_pandas:.2f}ms")

모범 사례

1. 최적화하기 전에 먼저 프로파일링

# 추측하지 말고 측정하세요!
config.enable_profiling()
result = your_query.to_df()
print(get_profiler().report())

2. 테스트 사이에는 초기화하세요

profiler.clear()  # 이전 데이터 초기화
# 테스트 실행
print(profiler.report())

3. 초점을 맞추려면 min_duration_ms를 사용하세요

# 100ms 이상 걸리는 작업만 표시
profiler.report(min_duration_ms=100)

4. 대표성 있는 데이터를 프로파일링하세요

# 실제 데이터 크기로 프로파일링하세요
# 소규모 테스트 데이터로는 실제 병목 현상이 드러나지 않을 수 있습니다

5. 운영 환경에서는 비활성화

# 개발
config.enable_profiling()

# 운영
config.set_profiling_enabled(False)  # 오버헤드 방지

예시: 전체 프로파일링 세션

from chdb import datastore as pd
from chdb.datastore.config import config, get_profiler

# Setup
config.enable_profiling()
config.enable_debug()  # 현재 진행 상황도 확인
profiler = get_profiler()

# 데이터 로드
profiler.clear()
print("=== Loading Data ===")
ds = pd.read_csv("sales_2024.csv")  # 1000만 행
print(profiler.report())

# 쿼리 1: 단순 필터
profiler.clear()
print("\n=== Query 1: Simple Filter ===")
result1 = ds.filter(ds['amount'] > 1000).to_df()
print(profiler.report())

# 쿼리 2: 복합 집계
profiler.clear()
print("\n=== Query 2: Complex Aggregation ===")
result2 = (ds
    .filter(ds['amount'] > 100)
    .groupby('region', 'category')
    .agg({
        'amount': ['sum', 'mean', 'count'],
        'quantity': 'sum'
    })
    .sort('sum', ascending=False)
    .head(20)
    .to_df()
)
print(profiler.report())

# 요약
print("\n=== Summary ===")
print(f"Query 1: {len(result1)} rows")
print(f"Query 2: {len(result2)} rows")

​빠른 시작

​프로파일링 활성화

​프로파일러 API

​프로파일러 가져오는 방법

​report()

​step()

​clear()

​summary()

​보고서 살펴보기

​단계 이름

​소요 시간

​병목 지점 파악

​프로파일링 패턴

​단일 쿼리 프로파일링

​여러 쿼리 프로파일링하기

​접근 방식 비교

​최적화 팁

​1. SQL 실행 시간 확인

​2. I/O 시간 확인

​3. 데이터 전송 확인

​4. 엔진 비교

​모범 사례

​1. 최적화하기 전에 먼저 프로파일링

​2. 테스트 사이에는 초기화하세요

​3. 초점을 맞추려면 min_duration_ms를 사용하세요

​4. 대표성 있는 데이터를 프로파일링하세요

​5. 운영 환경에서는 비활성화

​예시: 전체 프로파일링 세션