Tools → Apache Parquet Merge Tool

About the Apache Parquet Merge Tool

Over time, an Incorta table configured for incremental loads can generate hundreds of Apache Parquet files that are small in size. The Apache Parquet Merge tool is an interactive, command line tool that merges multiple Parquet table increment files into a single table increment file that contains the merged segments. As the merged files are typically about 1GB in size, the result is improved performance for reading Parquet files from Shared Storage.

There are three options for running the Parquet Merge Tool:

  • For a specific Tenant, merge all schema table increments
  • For a specific Tenant, merge all table increments for one or more Schemas
  • For a specific Tenant and Schema, merge one or more table increments

The tool backs up the original files into a backup directory and merges the table increments into the original directory.

To access the Parquet Merge Tool, follow these steps:

  • In the terminal for the Incorta Loader Node host, navigate to the default installation path as the incorta user:
    cd /home/incorta/IncortaAnalytics/IncortaNode/parquetMergeTool/

How to Use the Apache Parquet Merge Tool

For Linux operating systems, use the merge.sh file. For Windows operating systems, use the merge.bat file.

Apache Parquet Merge Tool Input Parameters

The interactive shell script has the following parameters:

Parameter Description
-path or --p Specify the path to the Tenant
-tenant or --t Specify the Tenant name
-schema or --s Optional. Specify the Schema name(s).
-table or --tab Optional. Specify the Table name(s).
-increments or --i Optional. Specify the minimum table increments count for the merge. The default is 100. If less than the minimum, the tool will not merge the table increments.
-help or --h Display the tool help information.

For a specific Tenant, to merge all schema table increments, execute the following:

./merge.sh -path /home/incorta/IncortaAnalytics/Tenants -tenant TenantName

For a specific Tenant, to merge all table increments for one Schema, execute the following:

./merge.sh -path /home/incorta/IncortaAnalytics/ -tenant TenantName -schema SchemaName_1  SchemaName_2

For a specific Tenant and Schema, to merge one or more table increments, execute the following:

./merge.sh -path /home/incorta/IncortaAnalytics/ -tenant TenantName -schema SchemaName_1  -table TableName

When the merge is complete, you will see the following message:

Merge is Done ! Do you want to delete the backup source files ? (press y to continue ,otherwise exit)

View the Apache Parquet Merge Tool Log Files

The Parquet Merge Tool generates a log file in the following default installation directory:

/home/incorta/IncortaAnalytics/parquetMergeTool/work/

The file contains the Source Segment, Target Segment, and Source Offset. Use this log file to determine the results of the merge activities.

© Incorta, Inc. All Rights Reserved.