Query Parquet data through Vertica (Vertica Hadoop Integration) - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Query Parquet data through Vertica (Vertica Hadoop Integration)

vertica add tag

Jesse (imported from SE)

So I have a Hadoop cluster with three nodes.  Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.

Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:

    CREATE EXTERNAL TABLE tableName (columns)
    AS COPY FROM "hdfs://hostname/...../data" PARQUET;

Since the data size is big. This method will not achieve good performance.

I have done some research,  [Vertica Hadoop Integration](https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/HadoopIntegrationGuide.htm%3FTocPath%3DIntegrating%2520with%2520Apache%2520Hadoop%7C_____0 "https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/IntroductionToHadoopIntegration.htm%3FTocPath%3DIntegrating%2520with%2520Apache%2520Hadoop%7C_____1")

I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.

My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?

EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.

Top Answer

Monica (imported from SE)

Terminology note: you're not using the HDFS Connector.  Which is good, as it's deprecated as of 8.0.1.  You're using the direct interface described in [Reading Hadoop Native File Formats](https://www.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/NativeFormats/ReadingNativeFormats.htm), with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme).  That's all good so far.  (You could also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)

Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you [should be getting the benefits of node locality automatically](https://www.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/Architecture/ConfiguringRackAwareness.htm) -- Vertica will use the nodes that have the data locally when planning queries.

As of Vertica version 8.0.1, you can improve query performance by partitioning and sorting the data so Vertica can use [predicate pushdown](https://www.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/NativeFormats/QueryPerformance.htm), and also by compressing the Parquet files.  (Partioning, sorting, and compressing your data to improve performance is not specific to Vertica, so these might be worth considering anyway if you also use other tools to query your data.)  The documentation of these techniques was [improved in 8.0.x](https://www.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/NativeFormats/QueryPerformance.htm) (link is to 8.1 but this was in 8.0.x too).

[Additional partitioning support](https://www.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/NativeFormats/UsingPartitions.htm) was added in 8.0.1.  It looks like you're using at least 8.0; I can't tell if you're using 8.0.1.  If you are, you can create the external table to only pay attention to the partitions you care about with something like:

    CREATE EXTERNAL TABLE t (id int, name varchar(50), 
                            created date, region varchar(50))
    AS COPY FROM 'hdfs:///path/*/*/*' 
    PARQUET(hive_partition_cols='created,region');
    
A query against this table such as the following takes advantage of the partitioning to limit what data is read:

    SELECT * FROM t WHERE region='northeast';

1 Answer