使用 Presto + Velox 读取 HDFS 上的数据

文章目录

1 Velox 解析 HDFS NN endpoint 逻辑
2 读取 HA 的 HDFS 集群信息
- 2.1 Java Coordinator 上的配置
- 2.2 C++ Worker 上的配置

当前 velox 支持了 HDFS、S3 以及本地文件系统，其中 HDFS 和 S3 模块是需要在编译的时候显示指定的，比如我们要测试 HDFS 功能，编译 prestissimo 的时候需要显示指定 PRESTO_ENABLE_HDFS=ON，如下：

PRESTO_ENABLE_HDFS=ON make release

Velox 解析 HDFS NN endpoint 逻辑

核心代码如下：

HdfsServiceEndpoint HdfsFileSystem::getServiceEndpoint(
    const std::string_view filePath,
    const Config* config) {
  auto endOfIdentityInfo = filePath.find('/', kScheme.size());
  std::string hdfsIdentity{
      filePath.data(), kScheme.size(), endOfIdentityInfo - kScheme.size()};
  if (hdfsIdentity.empty()) {
    // Fall back to get a fixed endpoint from config.
    auto hdfsHost = config->get("hive.hdfs.host");
    VELOX_CHECK(
        hdfsHost.hasValue(),
        "hdfsHost is empty, configuration missing for hdfs host");
    auto hdfsPort = config->get("hive.hdfs.port");
    VELOX_CHECK(
        hdfsPort.hasValue(),
        "hdfsPort is empty, configuration missing for hdfs port");
    return HdfsServiceEndpoint{*hdfsHost, *hdfsPort};
  }

  auto hostAndPortSeparator = hdfsIdentity.find(':', 0);
  // In HDFS HA mode, the hdfsIdentity is a nameservice ID with no port.
  if (hostAndPortSeparator == std::string::npos) {
    return HdfsServiceEndpoint{hdfsIdentity, ""};
  }
  std::string host{hdfsIdentity.data(), 0, hostAndPortSeparator};
  std::string port{
      hdfsIdentity.data(),
      hostAndPortSeparator + 1,
      hdfsIdentity.size() - hostAndPortSeparator - 1};
  return HdfsServiceEndpoint{host, port};
}

流程如下：
根据传进来的文件路径 filePath 查看路径中是否有 hdfs 的识别信息，

如果没有（比如路径是 hdfs:///user/hive/warehouse），那么会读取 hive.properties 文件里面的 hive.hdfs.host 和 hive.hdfs.port 两个参数的配置值;
如果有，这里又分两种情况：
- 如果 HDFS 集群是启用 HA 模式的，比如路径是 hdfs://iteblog-hdfs/tmp/presto-iteblog/fddbb65f-2b66-42db-97f8-ceb07b502b1e，这时候会解析出 iteblog-hdfs 值用于当做 HDFS 的识别信息；
- 如果路径里面包含了端口信息，比如 hdfs://iteblog-hdfs.iteblog.com:9000/tmp/presto-iteblog/fddbb65f-2b66-42db-97f8-ceb07b502b1e，那么 iteblog-hdfs.iteblog.com 就是 HDFS 的 host，9000 就是 HDFS 的 port。

读取 HA 的 HDFS 集群信息

一般线上用的 HDFS 集群都是 HA 模式的，为了能够读取到数据，我们需要把 HDFS 集群上的 hdfs-site.xml 文件拷贝到 Presto java Coordinator 和 Presto C++ worker 节点上，比如拷贝后的完整路径是 /opt/presto/etc/hdfs-site.xml，然后我们需要分别到 Coordinator 和 Worker 节点上进行相关的设置。

Java Coordinator 上的配置

Coordinator 端在计算表的 split 信息时需要读取 HDFS 的信息，所以 Coordinator 端也需要进行相关的设置。我们在 hive.properties 文件里面增加如下配置即可：

hive.config.resources=/opt/presto-server/etc/hdfs-site.xml

然后启动 Coordinator 节点。

C++ Worker 上的配置

Presto C++ worker 只需要设置好 LIBHDFS3_CONF=/opt/presto/etc/hdfs-site.xml 环境变量即可，然后重启 C++ worker 节点。

这时候我们就可以正常查询到 HDFS 上的数据了：

如果想及时了解Spark、Hadoop或者HBase相关的文章，欢迎关注微信公众号：过往记忆大数据

本博客文章除特别声明，全部都是原创！
原创文章版权归过往记忆大数据（过往记忆）所有，未经许可不得转载。
本文链接: 【使用 Presto + Velox 读取 HDFS 上的数据】（https://www.iteblog.com/archives/10203.html）