다른 JSON 포맷 처리 - ClickHouse Documentation

앞선 JSON 데이터 로드 예시에서는 JSONEachRow (NDJSON)를 사용한다고 가정합니다. 이 포맷은 각 JSON 줄의 키를 컬럼으로 읽어들입니다. 예시는 다음과 같습니다:

SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/json/*.json.gz', JSONEachRow)
LIMIT 5

┌───────date─┬─country_code─┬─project────────────┬─type────────┬─installer────┬─python_minor─┬─system─┬─version─┐
│ 2022-11-15 │ CN           │ clickhouse-connect │ bdist_wheel │ bandersnatch │              │        │ 0.2.8   │
│ 2022-11-15 │ CN           │ clickhouse-connect │ bdist_wheel │ bandersnatch │              │        │ 0.2.8   │
│ 2022-11-15 │ CN           │ clickhouse-connect │ bdist_wheel │ bandersnatch │              │        │ 0.2.8   │
│ 2022-11-15 │ CN           │ clickhouse-connect │ bdist_wheel │ bandersnatch │              │        │ 0.2.8   │
│ 2022-11-15 │ CN           │ clickhouse-connect │ bdist_wheel │ bandersnatch │              │        │ 0.2.8   │
└────────────┴──────────────┴────────────────────┴─────────────┴──────────────┴──────────────┴────────┴─────────┘

5 rows in set. Elapsed: 0.449 sec.

이는 일반적으로 JSON에서 가장 널리 사용되는 포맷이지만, 다른 포맷을 접하거나 JSON을 단일 객체로 읽어야 하는 경우도 있습니다. 아래에서는 JSON을 다른 일반적인 포맷으로 읽고 로드하는 예시를 제공합니다.

객체로 JSON 읽기

앞선 예시에서는 JSONEachRow가 줄바꿈으로 구분된 JSON을 읽는 방식을 보여주었습니다. 각 줄은 별도의 객체로 읽혀 테이블의 한 행에 매핑되고, 각 키는 컬럼에 매핑됩니다. 이 방식은 각 컬럼의 타입이 하나로 고정된, 예측 가능한 JSON에 적합합니다. 반면 JSONAsObject는 각 줄을 하나의 JSON 객체로 처리해 JSON 타입의 단일 컬럼에 저장하므로, 중첩된 JSON payload나 키가 동적이고 하나 이상의 타입을 가질 수 있는 경우에 더 적합합니다. 행 단위 삽입에는 JSONEachRow를 사용하고, 유연하거나 동적인 JSON 데이터를 저장할 때는 JSONAsObject를 사용하십시오. 위 예시와 비교해 보면, 다음 쿼리는 동일한 데이터를 각 줄마다 하나의 JSON 객체로 읽습니다:

SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/json/*.json.gz', JSONAsObject)
LIMIT 5

┌─json─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.338 sec.

JSONAsObject는 단일 JSON 객체 컬럼을 사용해 테이블에 행을 삽입할 때 유용합니다. 예:

CREATE TABLE pypi
(
    `json` JSON
)
ENGINE = MergeTree
ORDER BY tuple();

INSERT INTO pypi SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/pypi/json/*.json.gz', JSONAsObject)
LIMIT 5;

SELECT *
FROM pypi
LIMIT 2;

┌─json─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
│ {"country_code":"CN","date":"2022-11-15","installer":"bandersnatch","project":"clickhouse-connect","python_minor":"","system":"","type":"bdist_wheel","version":"0.2.8"} │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.003 sec.

JSONAsObject 포맷은 객체 구조가 일관되지 않은 경우에도 줄바꿈으로 구분된 JSON을 읽는 데 유용할 수 있습니다. 예를 들어, 어떤 키의 타입이 행마다 달라질 수 있습니다(어떤 경우에는 문자열이지만, 다른 경우에는 객체일 수 있음). 이런 경우 ClickHouse는 JSONEachRow로 안정적인 스키마를 추론할 수 없으며, JSONAsObject를 사용하면 엄격한 타입 검증 없이 데이터를 수집하고 각 JSON 행 전체를 하나의 컬럼에 저장할 수 있습니다. 예를 들어, 다음 예시에서 JSONEachRow가 어떻게 실패하는지 확인해 보십시오:

SELECT count()
FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0001.json.gz', 'JSONEachRow')

Elapsed: 1.198 sec.

Received exception from server (version 24.12.1):
Code: 636. DB::Exception: Received from sql-clickhouse.clickhouse.com:9440. DB::Exception: The table structure cannot be extracted from a JSONEachRow format file. Error:
Code: 117. DB::Exception: JSON objects have ambiguous data: in some objects path 'record.subject' has type 'String' and in some - 'Tuple(`$type` String, cid String, uri String)'. You can enable setting input_format_json_use_string_type_for_ambiguous_paths_in_named_tuples_inference_from_objects to use String type for path 'record.subject'. (INCORRECT_DATA) (version 24.12.1.18239 (official build))
To increase the maximum number of rows/bytes to read for structure determination, use setting input_format_max_rows_to_read_for_schema_inference/input_format_max_bytes_to_read_for_schema_inference.
You can specify the structure manually: (in file/uri bluesky/file_0001.json.gz). (CANNOT_EXTRACT_TABLE_STRUCTURE)

반대로, JSON 타입은 동일한 서브컬럼에 대해 여러 타입을 지원하므로 이 경우 JSONAsObject를 사용할 수 있습니다.

SELECT count()
FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0001.json.gz', 'JSONAsObject')

┌─count()─┐
│ 1000000 │
└─────────┘

1 row in set. Elapsed: 0.480 sec. Processed 1.00 million rows, 256.00 B (2.08 million rows/s., 533.76 B/s.)

JSON 객체 배열

가장 널리 사용되는 JSON 데이터 형식 중 하나는 이 예시와 같이 JSON 배열에 JSON 객체 목록이 들어 있는 형태입니다:

> cat list.json
[
  {
    "path": "Akiba_Hebrew_Academy",
    "month": "2017-08-01",
    "hits": 241
  },
  {
    "path": "Aegithina_tiphia",
    "month": "2018-02-01",
    "hits": 34
  },
  ...
]

이러한 데이터용 테이블을 만들어 보겠습니다:

CREATE TABLE sometable
(
    `path` String,
    `month` Date,
    `hits` UInt32
)
ENGINE = MergeTree
ORDER BY tuple(month, path)

JSON 객체 목록을 가져오려면 JSONEachRow 포맷을 사용할 수 있습니다(list.json 파일에서 데이터를 삽입):

INSERT INTO sometable
FROM INFILE 'list.json'
FORMAT JSONEachRow

로컬 파일의 데이터를 불러오기 위해 FROM INFILE 절을 사용했으며, 가져오기가 성공한 것을 확인할 수 있습니다:

SELECT *
FROM sometable

┌─path──────────────────────┬──────month─┬─hits─┐
│ 1971-72_Utah_Stars_season │ 2016-10-01 │    1 │
│ Akiba_Hebrew_Academy      │ 2017-08-01 │  241 │
│ Aegithina_tiphia          │ 2018-02-01 │   34 │
└───────────────────────────┴────────────┴──────┘

JSON 객체의 키

경우에 따라 JSON 객체 목록은 배열 요소가 아니라 객체 속성으로 인코딩될 수 있습니다(objects.json을 예시로 참조하십시오):

cat objects.json

{
  "a": {
    "path":"April_25,_2017",
    "month":"2018-01-01",
    "hits":2
  },
  "b": {
    "path":"Akahori_Station",
    "month":"2016-06-01",
    "hits":11
  },
  ...
}

ClickHouse는 JSONObjectEachRow 포맷을 사용하여 이러한 종류의 데이터로부터 데이터를 로드할 수 있습니다.

INSERT INTO sometable FROM INFILE 'objects.json' FORMAT JSONObjectEachRow;
SELECT * FROM sometable;

┌─path────────────┬──────month─┬─hits─┐
│ Abducens_palsy  │ 2016-05-01 │   28 │
│ Akahori_Station │ 2016-06-01 │   11 │
│ April_25,_2017  │ 2018-01-01 │    2 │
└─────────────────┴────────────┴──────┘

상위 객체 키 값 지정

상위 객체 키의 값도 테이블에 저장하려는 경우를 가정해 보겠습니다. 이 경우, 키 값을 저장할 컬럼 이름을 지정하려면 다음 옵션을 사용할 수 있습니다:

SET format_json_object_each_row_column_for_object_name = 'id'

이제 file() 함수를 사용해 원본 JSON 파일에서 어떤 데이터가 로드되는지 확인할 수 있습니다:

SELECT * FROM file('objects.json', JSONObjectEachRow)

┌─id─┬─path────────────┬──────month─┬─hits─┐
│ a  │ April_25,_2017  │ 2018-01-01 │    2 │
│ b  │ Akahori_Station │ 2016-06-01 │   11 │
│ c  │ Abducens_palsy  │ 2016-05-01 │   28 │
└────┴─────────────────┴────────────┴──────┘

id 컬럼이 키 값으로 올바르게 채워진 것을 확인하세요.

JSON 배열

공간을 절약하기 위해 JSON 파일을 객체 대신 배열로 인코딩하는 경우가 있습니다. 이 경우 JSON 배열 목록을 사용합니다:

cat arrays.json

["Akiba_Hebrew_Academy", "2017-08-01", 241],
["Aegithina_tiphia", "2018-02-01", 34],
["1971-72_Utah_Stars_season", "2016-10-01", 1]

이 경우 ClickHouse는 이 데이터를 로드한 뒤, 배열의 순서에 따라 각 값을 해당 컬럼에 매핑합니다. 이를 위해 JSONCompactEachRow 포맷을 사용합니다:

SELECT * FROM sometable

┌─c1────────────────────────┬─────────c2─┬──c3─┐
│ Akiba_Hebrew_Academy      │ 2017-08-01 │ 241 │
│ Aegithina_tiphia          │ 2018-02-01 │  34 │
│ 1971-72_Utah_Stars_season │ 2016-10-01 │   1 │
└───────────────────────────┴────────────┴─────┘

JSON 배열에서 개별 컬럼 가져오기

경우에 따라 데이터는 행 단위가 아니라 컬럼 단위로 인코딩될 수 있습니다. 이 경우 상위 JSON 객체는 값을 담은 컬럼을 포함합니다. 다음 파일을 확인하십시오:

cat columns.json

{
  "path": ["2007_Copa_America", "Car_dealerships_in_the_USA", "Dihydromyricetin_reductase"],
  "month": ["2016-07-01", "2015-07-01", "2015-07-01"],
  "hits": [178, 11, 1]
}

ClickHouse는 다음과 같은 형식의 데이터를 파싱하기 위해 JSONColumns 포맷을 사용합니다:

SELECT * FROM file('columns.json', JSONColumns)

┌─path───────────────────────┬──────month─┬─hits─┐
│ 2007_Copa_America          │ 2016-07-01 │  178 │
│ Car_dealerships_in_the_USA │ 2015-07-01 │   11 │
│ Dihydromyricetin_reductase │ 2015-07-01 │    1 │
└────────────────────────────┴────────────┴──────┘

객체 대신 컬럼 배열을 다룰 때 사용할 수 있는 더 간결한 JSONCompactColumns 포맷도 지원됩니다:

SELECT * FROM file('columns-array.json', JSONCompactColumns)

┌─c1──────────────┬─────────c2─┬─c3─┐
│ Heidenrod       │ 2017-01-01 │ 10 │
│ Arthur_Henrique │ 2016-11-01 │ 12 │
│ Alan_Ebnother   │ 2015-11-01 │ 66 │
└─────────────────┴────────────┴────┘

파싱하지 않고 JSON 객체 저장하기

JSON 객체를 파싱하지 않고 하나의 String(또는 JSON) 컬럼에 저장하는 편이 나은 경우도 있습니다. 이는 구조가 서로 다른 JSON 객체 목록을 다룰 때 유용합니다. 예를 들어 이 파일에는 상위 목록 안에 서로 다른 JSON 객체가 여러 개 들어 있습니다:

cat custom.json

[
  {"name": "Joe", "age": 99, "type": "person"},
  {"url": "/my.post.MD", "hits": 1263, "type": "post"},
  {"message": "Warning on disk usage", "type": "log"}
]

원본 JSON 객체를 다음 테이블(table)에 저장합니다:

CREATE TABLE events
(
    `data` String
)
ENGINE = MergeTree
ORDER BY ()

이제 JSON 객체를 파싱하지 않고 그대로 유지하려면 JSONAsString 포맷을 사용해 파일의 데이터를 이 테이블로 로드할 수 있습니다:

INSERT INTO events (data)
FROM INFILE 'custom.json'
FORMAT JSONAsString

또한 저장된 객체를 쿼리하는 데 JSON functions을 사용할 수 있습니다:

SELECT
    JSONExtractString(data, 'type') AS type,
    data
FROM events

┌─type───┬─data─────────────────────────────────────────────────┐
│ person │ {"name": "Joe", "age": 99, "type": "person"}         │
│ post   │ {"url": "/my.post.MD", "hits": 1263, "type": "post"} │
│ log    │ {"message": "Warning on disk usage", "type": "log"}  │
└────────┴──────────────────────────────────────────────────────┘

JSONAsString은 JSON 객체가 줄마다 하나씩 있는 형식의 파일(일반적으로 JSONEachRow 포맷과 함께 사용됨)에서 문제없이 작동합니다.

중첩된 객체용 스키마

중첩된 JSON 객체를 다룰 때는 명시적인 스키마를 추가로 정의하고, 복합 타입(Array, JSON 또는 Tuple)을 사용해 데이터를 로드할 수 있습니다:

SELECT *
FROM file('list-nested.json', JSONEachRow, 'page Tuple(path String, title String, owner_id UInt16), month Date, hits UInt32')
LIMIT 1

┌─page───────────────────────────────────────────────┬──────month─┬─hits─┐
│ ('Akiba_Hebrew_Academy','Akiba Hebrew Academy',12) │ 2017-08-01 │  241 │
└────────────────────────────────────────────────────┴────────────┴──────┘

중첩된 JSON 객체에 접근하기

다음 설정 옵션을 활성화하면 중첩된 JSON 키를 참조할 수 있습니다:

SET input_format_import_nested_json = 1

이를 통해 점 표기법을 사용해 중첩된 JSON 객체 키를 참조할 수 있습니다(작동하려면 해당 키를 백틱(`)으로 감싸야 한다는 점을 기억하십시오):

SELECT *
FROM file('list-nested.json', JSONEachRow, '`page.owner_id` UInt32, `page.title` String, month Date, hits UInt32')
LIMIT 1

┌─page.owner_id─┬─page.title───────────┬──────month─┬─hits─┐
│            12 │ Akiba Hebrew Academy │ 2017-08-01 │  241 │
└───────────────┴──────────────────────┴────────────┴──────┘

이렇게 하면 중첩된 JSON 객체를 펼치거나, 일부 중첩 값을 별도의 컬럼으로 저장할 수 있습니다.

알 수 없는 컬럼 건너뛰기

기본적으로 ClickHouse는 JSON 데이터를 가져올 때 알 수 없는 컬럼을 무시합니다. 이제 month 컬럼이 없는 테이블로 원본 파일을 가져와 보겠습니다:

CREATE TABLE shorttable
(
    `path` String,
    `hits` UInt32
)
ENGINE = MergeTree
ORDER BY path

3개의 컬럼으로 이루어진 원본 JSON 데이터를 이 테이블에 계속 삽입할 수 있습니다:

INSERT INTO shorttable FROM INFILE 'list.json' FORMAT JSONEachRow;
SELECT * FROM shorttable

┌─path──────────────────────┬─hits─┐
│ 1971-72_Utah_Stars_season │    1 │
│ Aegithina_tiphia          │   34 │
│ Akiba_Hebrew_Academy      │  241 │
└───────────────────────────┴──────┘

ClickHouse는 가져오는 과정에서 알 수 없는 컬럼을 무시합니다. 이 동작은 input_format_skip_unknown_fields 설정 옵션을 사용하여 비활성화할 수 있습니다:

SET input_format_skip_unknown_fields = 0;
INSERT INTO shorttable FROM INFILE 'list.json' FORMAT JSONEachRow;

Ok.
Exception on client:
Code: 117. DB::Exception: Unknown field found while parsing JSONEachRow format: month: (in file/uri /data/clickhouse/user_files/list.json): (at row 1)

ClickHouse는 JSON과 테이블의 컬럼 구조가 일치하지 않는 경우 예외를 발생시킵니다.

BSON

ClickHouse는 BSON으로 인코딩된 파일로 데이터를 내보내고, 이러한 파일에서 데이터를 가져올 수 있습니다. 이 포맷은 일부 DBMS에서 사용되며, 예를 들어 MongoDB 데이터베이스에서 사용됩니다. BSON 데이터를 가져오려면 BSONEachRow 포맷을 사용합니다. 이 BSON 파일에서 데이터를 가져와 보겠습니다:

SELECT * FROM file('data.bson', BSONEachRow)

┌─path──────────────────────┬─month─┬─hits─┐
│ Bob_Dolman                │ 17106 │  245 │
│ 1-krona                   │ 17167 │    4 │
│ Ahmadabad-e_Kalij-e_Sofla │ 17167 │    3 │
└───────────────────────────┴───────┴──────┘

같은 포맷으로 BSON 파일로도 내보낼 수 있습니다:

SELECT *
FROM sometable
INTO OUTFILE 'out.bson'
FORMAT BSONEachRow

이후 데이터가 out.bson 파일로 내보내집니다.

​객체로 JSON 읽기

​JSON 객체 배열

​JSON 객체의 키

​상위 객체 키 값 지정

​JSON 배열

​JSON 배열에서 개별 컬럼 가져오기

​파싱하지 않고 JSON 객체 저장하기

​중첩된 객체용 스키마

​중첩된 JSON 객체에 접근하기

​알 수 없는 컬럼 건너뛰기

​BSON