hdfs 表函数 - ClickHouse Documentation

从 HDFS 中的文件创建表。此表函数与 url 和 file 表函数类似。

语法

hdfs(URI, format, structure)

参数

参数	描述
`URI`	文件在 HDFS 中的相对 URI。文件路径在只读模式下支持以下通配符：`*`、`?`、`{abc,def}` 和 `{N..M}`，其中 `N`、`M` 表示数字，`'abc'`、`'def'` 表示字符串。
`format`	文件的格式。
`structure`	表的结构。格式为 `'column1_name column1_type, column2_name column2_type, ...'`。

返回值

一个具有指定结构的表，可用于从指定文件读取数据或向其中写入数据。示例来自 hdfs://hdfs1:9000/test 的表，以及从中选取前两行：

SELECT *
FROM hdfs('hdfs://hdfs1:9000/test', 'TSV', 'column1 UInt32, column2 UInt32, column3 UInt32')
LIMIT 2

┌─column1─┬─column2─┬─column3─┐
│       1 │       2 │       3 │
│       3 │       2 │       1 │
└─────────┴─────────┴─────────┘

路径中的通配符

路径可以使用通配符。文件必须匹配整个路径模式，而不只是后缀或前缀。

* — 表示任意多个字符，但不包括 /，也可以表示空字符串。
** — 表示递归匹配文件夹内的所有文件。
? — 表示任意单个字符。
{some_string,another_string,yet_another_one} — 替换为字符串 'some_string'、'another_string'、'yet_another_one' 中的任意一个。这些字符串可以包含 / 符号。
{N..M} — 表示任何 >= N 且 <= M 的数字。

带有 {} 的构造与 remote 和 file 表函数中的写法类似。示例

假设我们在 HDFS 上有多个文件，其 URI 如下：

‘hdfs://hdfs1:9000/some_dir/some_file_1’
‘hdfs://hdfs1:9000/some_dir/some_file_2’
‘hdfs://hdfs1:9000/some_dir/some_file_3’
‘hdfs://hdfs1:9000/another_dir/some_file_1’
‘hdfs://hdfs1:9000/another_dir/some_file_2’
‘hdfs://hdfs1:9000/another_dir/some_file_3’

查询这些文件中的行数：

SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV', 'name String, value UInt32')

SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV', 'name String, value UInt32')

如果文件列表中包含带前导零的数字范围，请对每一位数字分别使用花括号写法，或使用 ?。

示例查询名为 file000、file001、…、file999 的文件中的数据：

SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV', 'name String, value UInt32')

虚拟列

_path — 文件路径。类型：LowCardinality(String)。
_file — 文件名。类型：LowCardinality(String)。
_size — 文件大小 (单位：字节) 。类型：Nullable(UInt64)。如果大小未知，则值为 NULL。
_time — 文件的最后修改时间。类型：Nullable(DateTime)。如果时间未知，则值为 NULL。

use_hive_partitioning 设置

当 use_hive_partitioning 设置为 1 时，ClickHouse 会检测路径中采用 Hive 风格的分区 (/name=value/) ，并允许在查询中将分区列作为虚拟列使用。这些虚拟列的名称将与分区路径中的名称相同。示例使用通过 Hive 风格分区生成的虚拟列

SELECT * FROM HDFS('hdfs://hdfs1:9000/data/path/date=*/country=*/code=*/*.parquet') WHERE date > '2020-01-01' AND country = 'Netherlands' AND code = 42;

存储设置

hdfs_truncate_on_insert - 允许在插入前截断文件。默认禁用。
hdfs_create_new_file_on_insert - 如果格式带有后缀，则允许在每次插入时创建新文件。默认禁用。
hdfs_skip_empty_files - 允许在读取时跳过空文件。默认禁用。

虚拟列

​语法

​参数

​返回值

​路径中的通配符

​虚拟列

​use_hive_partitioning 设置

​存储设置

​相关

语法

参数

返回值

路径中的通配符

虚拟列

use_hive_partitioning 设置

存储设置

相关