Amazonの顧客レビュー - ClickHouse Documentation

このデータセットには、Amazon 製品に関する 1 億 5,000 万件超のカスタマーレビューが含まれています。データは AWS S3 上の snappy 圧縮された Parquet ファイルとして保存されており、圧縮後の合計サイズは 49GB です。これを ClickHouse に挿入する手順を見ていきましょう。

以下のクエリは、ClickHouse Cloud の Production インスタンスで実行しています。詳細は “Playground の仕様” を参照してください。

データセットの読み込み

データを ClickHouse に挿入しなくても、その場で直接クエリできます。どのようなデータか確認するために、まずはいくつかの行を取得してみましょう:

SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet')
LIMIT 3

行は次のようになります:

Row 1:
──────
review_date:       16462
marketplace:       US
customer_id:       25444946 -- 25.44 million
review_id:         R146L9MMZYG0WA
product_id:        B00NV85102
product_parent:    908181913 -- 908.18 million
product_title:     XIKEZAN iPhone 6 Plus 5.5 inch Waterproof Case, Shockproof Dirtproof Snowproof Full Body Skin Case Protective Cover with Hand Strap & Headphone Adapter & Kickstand
product_category:  Wireless
star_rating:       4
helpful_votes:     0
total_votes:       0
vine:              false
verified_purchase: true
review_headline:   case is sturdy and protects as I want
review_body:       I won't count on the waterproof part (I took off the rubber seals at the bottom because the got on my nerves). But the case is sturdy and protects as I want.

Row 2:
──────
review_date:       16462
marketplace:       US
customer_id:       1974568 -- 1.97 million
review_id:         R2LXDXT293LG1T
product_id:        B00OTFZ23M
product_parent:    951208259 -- 951.21 million
product_title:     Season.C Chicago Bulls Marilyn Monroe No.1 Hard Back Case Cover for Samsung Galaxy S5 i9600
product_category:  Wireless
star_rating:       1
helpful_votes:     0
total_votes:       0
vine:              false
verified_purchase: true
review_headline:   One Star
review_body:       Cant use the case because its big for the phone. Waist of money!

Row 3:
──────
review_date:       16462
marketplace:       US
customer_id:       24803564 -- 24.80 million
review_id:         R7K9U5OEIRJWR
product_id:        B00LB8C4U4
product_parent:    524588109 -- 524.59 million
product_title:     iPhone 5s Case, BUDDIBOX [Shield] Slim Dual Layer Protective Case with Kickstand for Apple iPhone 5 and 5s
product_category:  Wireless
star_rating:       4
helpful_votes:     0
total_votes:       0
vine:              false
verified_purchase: true
review_headline:   but overall this case is pretty sturdy and provides good protection for the phone
review_body:       The front piece was a little difficult to secure to the phone at first, but overall this case is pretty sturdy and provides good protection for the phone, which is what I need. I would buy this case again.

このデータをClickHouseに保存するため、amazon_reviews という名前の新しい MergeTree テーブルを定義します。

CREATE DATABASE amazon

CREATE TABLE amazon.amazon_reviews
(
    `review_date` Date,
    `marketplace` LowCardinality(String),
    `customer_id` UInt64,
    `review_id` String,
    `product_id` String,
    `product_parent` UInt64,
    `product_title` String,
    `product_category` LowCardinality(String),
    `star_rating` UInt8,
    `helpful_votes` UInt32,
    `total_votes` UInt32,
    `vine` Bool,
    `verified_purchase` Bool,
    `review_headline` String,
    `review_body` String,
    PROJECTION helpful_votes
    (
        SELECT *
        ORDER BY helpful_votes
    )
)
ENGINE = MergeTree
ORDER BY (review_date, product_category)

次の INSERT コマンドでは s3Cluster テーブル関数を使用します。これにより、クラスター内のすべてのノードを使って複数の S3 ファイルを並列に処理できます。また、ワイルドカードを使用して、https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet で始まるファイルをすべて挿入します:

INSERT INTO amazon.amazon_reviews SELECT *
FROM s3Cluster('default', 
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet')

ClickHouse Cloud では、クラスター名は default です。default をお使いのクラスター名に変更してください。クラスターがない場合は、s3Cluster の代わりに s3 テーブル関数を使用してください。

このクエリの実行にはそれほど時間はかからず、平均で毎秒約300,000行を処理します。5分ほどで、すべての行が挿入されたことを確認できるはずです。

データがどれくらいの領域を使用しているかを見てみましょう。

元のデータは約70Gでしたが、ClickHouse で圧縮すると約30Gになります。

クエリ例

いくつかクエリを実行してみましょう。データセット内で「参考になった」票が最も多いレビューの上位 10 件は次のとおりです。

このクエリでは、パフォーマンス向上のために projection を使用しています。

Amazon でレビュー数が最も多い商品の上位 10 件は次のとおりです。

各商品の月ごとの平均レビュー評価は次のとおりです (実際の Amazon の採用面接問題です) 。

商品カテゴリごとの総投票数は次のとおりです。このクエリが高速なのは、product_category が主キーに含まれているためです。

レビュー内で “awful” という単語の出現頻度が最も高い商品を探してみましょう。これは大がかりな処理で、1 つの単語を探すために 1 億 5100 万件を超える文字列を解析する必要があります。

runnable

SELECT
    product_id,
    any(product_title),
    avg(star_rating),
    count() AS count
FROM amazon.amazon_reviews
WHERE position(review_body, 'awful') > 0
GROUP BY product_id
ORDER BY count DESC
LIMIT 50;

これほど大量のデータに対するクエリ時間に注目してください。結果を読むのもなかなか面白いです！

今度はレビュー内で awesome を検索する点だけ変えて、同じクエリをもう一度実行します:

runnable

SELECT 
    product_id,
    any(product_title),
    avg(star_rating),
    count() AS count
FROM amazon.amazon_reviews
WHERE position(review_body, 'awesome') > 0
GROUP BY product_id
ORDER BY count DESC
LIMIT 50;

​データセットの読み込み

​クエリ例

データセットの読み込み

クエリ例