ClickHouse 现已提供原生 JSON 列类型,专为半结构化和动态数据而设计。需要说明的是,这是一种列类型,不是数据格式——你可以将 JSON 作为字符串插入 ClickHouse,也可以通过 JSONEachRow 等受支持的格式插入,但这并不表示你使用的就是 JSON 列类型。只有在数据结构是动态变化的情况下,才应该使用 JSON 类型;如果只是恰好存储了 JSON,则不应使用它。
JSON 类型适用于对结构动态或不可预测的 JSON 对象中的特定字段进行查询、过滤和聚合。它通过将 JSON 对象拆分为独立的子列来实现这一点。与 Map 或将字符串解析后再处理等替代方案相比,这种方式能显著减少读取的数据量,并加快针对所选字段的查询。
不过,这也伴随着一些重要的权衡:
INSERT 更慢 - 将 JSON 拆分为子列、执行类型推断以及管理灵活的存储结构,会使插入速度比将 JSON 存储为简单的 String 列更慢。
- 读取整个对象时更慢 - 如果你需要获取完整的 JSON 文档 (而不是特定字段) ,
JSON 类型会比从 String 列中读取更慢。当你不进行字段级查询时,从独立子列重建对象所带来的额外开销并无收益。
- 存储开销 - 与将 JSON 存储为单个字符串值相比,维护独立子列会增加额外的结构开销。
- 你的数据具有动态或不可预测的结构,不同文档中的键各不相同
- 字段类型或 schema 会随时间变化,或因记录而异
- 你需要对 JSON 对象中特定路径上的数据进行查询、过滤或聚合,但其结构无法预先确定
- 你的使用场景涉及半结构化数据,例如日志、事件或用户生成内容,且其 schema 不一致
在以下情况下使用 String 列 (或结构化类型) :
- 你的数据结构已知且一致——在这种情况下,应改用普通列、
Tuple、Array、Dynamic 或 Variant 类型
JSON 文档被视为不透明 blob,只会被完整存储和取回,而不会进行字段级分析
- 你不需要在数据库中对单个
JSON 字段进行查询或过滤
JSON 只是传输/存储格式,不会在 ClickHouse 内部进行分析
如果 JSON 是一种不会在数据库内部分析的不透明文档,只是存储后再取回,那么它应存储为 String 字段。只有当你需要对动态 JSON 结构中的特定字段高效执行查询、过滤或聚合时,JSON 类型的优势才会真正体现出来。你也可以混合使用这两种方式——对可预测的顶层字段使用标准列,对载荷中的动态部分使用 JSON 列。
JSON 类型通过将路径展平为子列,实现了高效的列式存储。但灵活性越高,也越需要谨慎使用。要想高效使用它:
类型提示类型提示不仅能避免不必要的类型推断——还可以彻底消除存储和处理过程中的间接层。带有类型提示的 JSON 路径会始终像传统列一样存储,无需依赖 判别列 或在查询时进行动态分辨率。这意味着,只要类型提示定义得当,嵌套 JSON 字段就能获得与一开始就建模为顶层字段时相同的性能和效率。因此,对于那些整体上较为一致、但仍希望保留 JSON 灵活性的数据集,类型提示提供了一种便捷方式,让你无需重构 schema 或摄取管道,也能保持性能。
如需更多指导,请参阅 ClickHouse JSON 文档,或阅读我们的博文 ClickHouse 的一种强大全新 JSON 数据类型。
请看下面的 JSON 样本,表示 Python PyPI dataset 中的一行数据:
{
"date": "2022-11-15",
"country_code": "ES",
"project": "clickhouse-connect",
"type": "bdist_wheel",
"installer": "pip",
"python_minor": "3.9",
"system": "Linux",
"version": "0.3.0"
}
假设这个 schema 是静态的,并且类型可以明确定义。即使数据采用 NDJSON 格式 (每行一个 JSON 对象) ,对于这样的 schema,也没必要使用 JSON 类型。只需用经典类型来定义 schema。
CREATE TABLE pypi (
`date` Date,
`country_code` String,
`project` String,
`type` String,
`installer` String,
`python_minor` String,
`system` String,
`version` String
)
ENGINE = MergeTree
ORDER BY (project, date)
并插入 JSON 行:
INSERT INTO pypi FORMAT JSONEachRow
{"date":"2022-11-15","country_code":"ES","project":"clickhouse-connect","type":"bdist_wheel","installer":"pip","python_minor":"3.9","system":"Linux","version":"0.3.0"}
来看这个包含 250 万篇学术论文的 arXiv 数据集。该数据集以 NDJSON 格式提供,其中每一行都代表一篇已发表的学术论文。下面展示一个示例行:
{
"id": "2101.11408",
"submitter": "Daniel Lemire",
"authors": "Daniel Lemire",
"title": "Number Parsing at a Gigabyte per Second",
"comments": "Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/",
"journal-ref": "Software: Practice and Experience 51 (8), 2021",
"doi": "10.1002/spe.2984",
"report-no": null,
"categories": "cs.DS cs.MS",
"license": "http://creativecommons.org/licenses/by/4.0/",
"abstract": "With disks and networks providing gigabytes per second ....\n",
"versions": [
{
"created": "Mon, 11 Jan 2021 20:31:27 GMT",
"version": "v1"
},
{
"created": "Sat, 30 Jan 2021 23:57:29 GMT",
"version": "v2"
}
],
"update_date": "2022-11-07",
"authors_parsed": [
[
"Lemire",
"Daniel",
""
]
]
}
尽管这里的 JSON 很复杂,包含嵌套结构,但其结构是可预测的。字段的数量和类型都不会变化。虽然在这个示例中我们可以使用 JSON 类型,但也可以直接用 Tuples 和 Nested 类型显式定义其结构:
CREATE TABLE arxiv
(
`id` String,
`submitter` String,
`authors` String,
`title` String,
`comments` String,
`journal-ref` String,
`doi` String,
`report-no` String,
`categories` String,
`license` String,
`abstract` String,
`versions` Array(Tuple(created String, version String)),
`update_date` Date,
`authors_parsed` Array(Array(String))
)
ENGINE = MergeTree
ORDER BY update_date
同样,我们也可以以 JSON 格式插入数据:
INSERT INTO arxiv FORMAT JSONEachRow
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]]}
假设又新增了一个名为 tags 的列。如果它只是一个字符串列表,我们可以将其建模为 Array(String);但这里假设你可以添加包含混合类型的任意标签结构 (注意,score 既可以是字符串,也可以是整数) 。修改后的 JSON 文档如下:
{
"id": "2101.11408",
"submitter": "Daniel Lemire",
"authors": "Daniel Lemire",
"title": "Number Parsing at a Gigabyte per Second",
"comments": "Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/",
"journal-ref": "Software: Practice and Experience 51 (8), 2021",
"doi": "10.1002/spe.2984",
"report-no": null,
"categories": "cs.DS cs.MS",
"license": "http://creativecommons.org/licenses/by/4.0/",
"abstract": "With disks and networks providing gigabytes per second ....\n",
"versions": [
{
"created": "Mon, 11 Jan 2021 20:31:27 GMT",
"version": "v1"
},
{
"created": "Sat, 30 Jan 2021 23:57:29 GMT",
"version": "v2"
}
],
"update_date": "2022-11-07",
"authors_parsed": [
[
"Lemire",
"Daniel",
""
]
],
"tags": {
"tag_1": {
"name": "ClickHouse user",
"score": "A+",
"comment": "A good read, applicable to ClickHouse"
},
"28_03_2025": {
"name": "professor X",
"score": 10,
"comment": "Didn't learn much",
"updates": [
{
"name": "professor X",
"comment": "Wolverine found more interesting"
}
]
}
}
}
在这种情况下,我们可以将 arXiv 文档建模为全部采用 JSON,或者只添加一个 JSON tags 列。下面给出这两种示例:
CREATE TABLE arxiv
(
`doc` JSON(update_date Date)
)
ENGINE = MergeTree
ORDER BY doc.update_date
我们在 JSON 定义中为 update_date 列提供了类型提示,因为会在排序键/主键中用到它。这有助于 ClickHouse 知道该列不会为 NULL,并确保它知道应使用哪个 update_date 子列 (每种类型都可能有多个,否则这里会存在歧义) 。
我们可以向该表插入数据,并使用 JSONAllPathsWithTypes 函数和 PrettyJSONEachRow 输出格式查看随后推断出的 schema:
INSERT INTO arxiv FORMAT JSONAsObject
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}
SELECT JSONAllPathsWithTypes(doc)
FROM arxiv
FORMAT PrettyJSONEachRow
{
"JSONAllPathsWithTypes(doc)": {
"abstract": "String",
"authors": "String",
"authors_parsed": "Array(Array(Nullable(String)))",
"categories": "String",
"comments": "String",
"doi": "String",
"id": "String",
"journal-ref": "String",
"license": "String",
"submitter": "String",
"tags.28_03_2025.comment": "String",
"tags.28_03_2025.name": "String",
"tags.28_03_2025.score": "Int64",
"tags.28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))",
"tags.tag_1.comment": "String",
"tags.tag_1.name": "String",
"tags.tag_1.score": "String",
"title": "String",
"update_date": "Date",
"versions": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))"
}
}
1 row in set. Elapsed: 0.003 sec.
或者,我们也可以沿用前面的 schema,并使用一个 JSON tags 列来对此建模。通常更推荐这种方式,因为它能将 ClickHouse 所需的推断降到最低:
CREATE TABLE arxiv
(
`id` String,
`submitter` String,
`authors` String,
`title` String,
`comments` String,
`journal-ref` String,
`doi` String,
`report-no` String,
`categories` String,
`license` String,
`abstract` String,
`versions` Array(Tuple(created String, version String)),
`update_date` Date,
`authors_parsed` Array(Array(String)),
`tags` JSON()
)
ENGINE = MergeTree
ORDER BY update_date
INSERT INTO arxiv FORMAT JSONEachRow
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}
我们现在可以推断出子列 tags 的类型。
SELECT JSONAllPathsWithTypes(tags)
FROM arxiv
FORMAT PrettyJSONEachRow
{
"JSONAllPathsWithTypes(tags)": {
"28_03_2025.comment": "String",
"28_03_2025.name": "String",
"28_03_2025.score": "Int64",
"28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))",
"tag_1.comment": "String",
"tag_1.name": "String",
"tag_1.score": "String"
}
}
1 row in set. Elapsed: 0.002 sec.