跳转到主要内容
该数据集包含超过 1.5 亿条 亚马逊 商品的用户评论。数据存储在 AWS S3 中采用 snappy 压缩的 Parquet 文件里,压缩后总大小为 49 GB。下面我们来逐步了解如何将其插入 ClickHouse。
以下查询是在 ClickHouse Cloud 的 Production 实例上执行的。更多信息请参见 “Playground 规范”

加载数据集

  1. 无需将数据插入 ClickHouse,我们就可以直接在原处查询。先取几行数据,看看它们的样子:
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet')
LIMIT 3
这些行如下:
Row 1:
──────
review_date:       16462
marketplace:       US
customer_id:       25444946 -- 2544万
review_id:         R146L9MMZYG0WA
product_id:        B00NV85102
product_parent:    908181913 -- 9.0818亿
product_title:     XIKEZAN iPhone 6 Plus 5.5 inch Waterproof Case, Shockproof Dirtproof Snowproof Full Body Skin Case Protective Cover with Hand Strap & Headphone Adapter & Kickstand
product_category:  Wireless
star_rating:       4
helpful_votes:     0
total_votes:       0
vine:              false
verified_purchase: true
review_headline:   case is sturdy and protects as I want
review_body:       I won't count on the waterproof part (I took off the rubber seals at the bottom because the got on my nerves). But the case is sturdy and protects as I want.

Row 2:
──────
review_date:       16462
marketplace:       US
customer_id:       1974568 -- 197万
review_id:         R2LXDXT293LG1T
product_id:        B00OTFZ23M
product_parent:    951208259 -- 9.5121亿
product_title:     Season.C Chicago Bulls Marilyn Monroe No.1 Hard Back Case Cover for Samsung Galaxy S5 i9600
product_category:  Wireless
star_rating:       1
helpful_votes:     0
total_votes:       0
vine:              false
verified_purchase: true
review_headline:   One Star
review_body:       Cant use the case because its big for the phone. Waist of money!

Row 3:
──────
review_date:       16462
marketplace:       US
customer_id:       24803564 -- 2480万
review_id:         R7K9U5OEIRJWR
product_id:        B00LB8C4U4
product_parent:    524588109 -- 5.2459亿
product_title:     iPhone 5s Case, BUDDIBOX [Shield] Slim Dual Layer Protective Case with Kickstand for Apple iPhone 5 and 5s
product_category:  Wireless
star_rating:       4
helpful_votes:     0
total_votes:       0
vine:              false
verified_purchase: true
review_headline:   but overall this case is pretty sturdy and provides good protection for the phone
review_body:       The front piece was a little difficult to secure to the phone at first, but overall this case is pretty sturdy and provides good protection for the phone, which is what I need. I would buy this case again.
  1. 定义一个名为 amazon_reviews 的新 MergeTree 表,用于在 ClickHouse 中存储这些数据:
CREATE DATABASE amazon

CREATE TABLE amazon.amazon_reviews
(
    `review_date` Date,
    `marketplace` LowCardinality(String),
    `customer_id` UInt64,
    `review_id` String,
    `product_id` String,
    `product_parent` UInt64,
    `product_title` String,
    `product_category` LowCardinality(String),
    `star_rating` UInt8,
    `helpful_votes` UInt32,
    `total_votes` UInt32,
    `vine` Bool,
    `verified_purchase` Bool,
    `review_headline` String,
    `review_body` String,
    PROJECTION helpful_votes
    (
        SELECT *
        ORDER BY helpful_votes
    )
)
ENGINE = MergeTree
ORDER BY (review_date, product_category)
  1. 以下 INSERT 命令使用了 s3Cluster table function,可利用集群中的所有节点并行处理多个 S3 文件。我们还使用了通配符,以插入所有以 https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet 开头的文件:
INSERT INTO amazon.amazon_reviews SELECT *
FROM s3Cluster('default', 
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet')
在 ClickHouse Cloud 中,集群名称为 default。请将 default 改为你的集群名称……如果你没有集群,也可以使用 s3 table function (而不是 s3Cluster) 。
  1. 该查询耗时不长,平均每秒约处理 300,000 行。大约 5 分钟后,你应该就能看到所有行都已插入:
  1. 我们来看看数据占用了多少空间:
原始数据约为 70G,但在 ClickHouse 中压缩后仅占用约 30G。

示例查询

  1. 我们来运行一些查询。以下是数据集中最有帮助的 10 条评论:
该查询使用了 projection 来提升性能。
  1. 以下是 亚马逊 上评论数量最多的 10 个产品:
  1. 以下是每个产品每月的平均评论评分 (这其实是一道 亚马逊 面试题!) :
  1. 以下是各产品类别的总票数。该查询很快,因为 product_category 在主键中:
  1. 我们来找出评论中 “awful” 一词出现频率最高的产品。这是一项很重的任务——需要解析超过 1.51 亿个字符串来查找这一个词:
runnable
SELECT
    product_id,
    any(product_title),
    avg(star_rating),
    count() AS count
FROM amazon.amazon_reviews
WHERE position(review_body, 'awful') > 0
GROUP BY product_id
ORDER BY count DESC
LIMIT 50;
请留意在如此庞大的数据量下的查询时间。结果读起来也很有意思!
  1. 我们可以再次运行相同的查询,不过这次是在评论中搜索 awesome
runnable
SELECT 
    product_id,
    any(product_title),
    avg(star_rating),
    count() AS count
FROM amazon.amazon_reviews
WHERE position(review_body, 'awesome') > 0
GROUP BY product_id
ORDER BY count DESC
LIMIT 50;
最后修改于 2026年6月10日