Connectors → Microsoft Azure Gen2

About Microsoft Azure Data Lake Storage Gen2

Microsoft Azure Data Lake Storage Gen2 (Azure Gen2), as a Data Lake, allows for the storage of structured or unstructured data in its raw format. Azure Gen2 is designed for enterprise scale data storage and big data analytics processing. The Azure Gen2 service combines the capabilities of its prior storage service generations, Azure Blob storage and Azure Data Lake Storage Gen1. In doing so, Azure Gen2 allows the same data to be accessed as if in directory storage or blob storage. This combination of storage options gives access to a combination of features such as file system semantics, file level security, tiered storage, and disaster recovery capabilities.

About the Microsoft Azure Gen2 connector

With the Azure Gen2 connector, you can create a data source for an Azure Gen2 data lake storage source. The Azure Gen2 connector supports the following file extensions:

  • .csv
  • .tsv
  • .tab
  • .txt
  • .xslx
  • .parquet
  • .orc

The Azure Gen2 connector supports the following Incorta specific functionality:

Feature Supported
Encryption at Ingest
Incremental Loading
Wildcard Union
Performance Optimization
Webhook Callbacks
Remote Tables

The Azure Gen2 connector requires one of the following authentication configurations:

Steps to Connect Azure Gen2 and Incorta

To connect your Azure Gen2 and Incorta, here are the high level steps, tools, and procedures:

Create an external data source

Here are the steps to create an external data source with the Azure Gen2 connector:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Data.
  • In the Action bar, select + NewAdd Data Source.
  • In the Choose a Data Source dialog, in Data lake, select Data lake - Azure Gen2 data source.
  • In the New Data Source dialog, specify the applicable connector properties.
  • To test, select Test Connection.
  • Select Ok to save your changes.

Azure Gen2 connector Authentication Type options

The Authentication Type will determine the connection properties for connecting to your Azure Gen2 data source. When creating an external data source, the following authenticaion types are available from a drop down list:

Type Description
Storage Account Key Select for authentication using a generated storage access key. A storage access key, or account key, can be assigned to only have access to specified storage.
Service Principal Select for service principal authentication through role-based access control. Used for creating application specified access.

Azure Gen2 connector properties for Storage Account Key Authentication

Here are the properties for the Azure Gen2 connector when using Storage Account Key Authentication:

Property Control Description
Data Source Name text box Enter the name of the data source
Account Key text box Enter the 512-bit authorization key.
Directory text box Enter the URI address to connect to Azure Data Lake Gen2 data source. Use the abfs:// schema identifier when not using TLS.
An abfss:// schema identifier will connect with a TLS connection.
Note

The URI syntax in the Directory connection property is dependent on whether you are connecting to a default file system.

Azure Gen2 connector properties for Service Principal Authentication

Here are the properties for the Azure Gen2 connector when using Service Principal Authentication:

Property Control Description
Data Source Name text box Enter the name of the data source
Client ID text box Enter client ID, also known as an application ID, which is created when registering an application.
Tenant ID text box Enter the Tenant ID, also known as a directory ID, which identifies the tenant to use for authentication.
Client Secret Key text box Enter the client secret key. This key is used for the client to prove identity during authentication.
Directory text box Enter the URI address to connect to Azure Data Lake Gen2 data source. Use the abfs:// schema identifier when not using TLS.
An abfss:// schema identifier will connect with a TLS connection.
Note

The URI syntax in the Directory connection property is dependent on whether you are connecting to a default file system.

Create a schema with the Schema Wizard

Here are the steps to create a Azure Gen2 schema with the Schema Wizard:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Schema Wizard
  • In (1) Choose a Source, specify the following:

    • For Enter a name, enter the schema name.
    • For Select a Datasource, select the Azure Gen2 external data source.
    • Optionally, create a description.
  • In the Schema Wizard footer, select Next.
  • In (2) Manage Tables, in the Data panel, navigate the directory tree as necessary to select your file.
Note

When navigating to the data source from the Data Panel, select files appropriately for creating a schema. File directories chosen at too high or low a directory level may result in a failure to retrieve data or incorrect scope of data for a table.

  • In the Schema Wizard footer, select Next.
  • In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create a Azure Gen2 schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Create Schema.
  • In Name, specify the schema name, and select Save.
  • In Start adding tables to your schema, select Data Lake.
  • In the Data Source dialog, specify the various properties data source properties.
  • Select Add.
  • In the Table Editor, in the Table Summary section, enter the table name.
  • To save your changes, select Done in the Action bar.

Azure Gen2 data source properties

You can specify a single file in the Data Source Dialog or a directory. Enable the Wildcard Union property to indicate the data source is a directory. Below are the data source properties divided by file type, single file, or folder.

Common data source properties for all file and directory types

Here are some of the common properties for all file and directory types:

Note

The following common properties also apply to all available properties for an ORC (.orc) file or directory.

Property Control Description
Type drop down list Default is File System
Data Source drop down list Select the Azure Gen2 external data source
Remote toggle Enable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized property affects data accessibility.
File Type drop down list Select the Text (.csv, .tsv, .tab, .txt), Excel (.xslx), Parquet (.parquet), or ORC (.orc) file type.
Incremental toggle Enable this property to support incremental loading.
Update File text box / button With Incremental enabled, enter the relative file path of the text file to update from. When adding to an existing table, the select button opens the Add File dialog. The Add File dialog shows the files from your Azure Gen2 data source. Select a single file and select Add.
Timestamp format in file name drop down list With Incremental enabled, select the timestamp format in the file name.
Incremental Extract Using drop down list With Incremental and Wildcard Union enabled, select the extraction method.
Wildcard Union toggle Enable this property to get data from a directory.
Directory Path text box With Wildcard Union enabled, enter the directory path relative to the root directory specified in the data source.
Apply Include Pattern On drop down list With Wildcard Union enabled, select this property to apply the Include Pattern to a file name or file relative path.
Include text box With Wildcard Union enabled, enter a keyword with a wildcard * symbol to include specific named files within the folder.
Include Sub-Directories toggle With Wildcard Union enabled, enable this property to include files from sub-folders
Include Filename as a Column toggle With Wildcard Union enabled, enable this property to add the filename of the file as a column. You will then need to specify a column name.
Filename column text box With Include Filename as a Column enabled, enter a column name for the filename, such as source_file_name
Callback toggle Enables the Callback URL field
Callback URL text box This property appears when the Callback toggle is enabled. Specify the URL.

Summary of Data Access Methods Based on Remote and Performance Optimized Settings

Table Properties Data Source Properties Parquet DDM Memory SQLi MV/ Notebooks Analytics
Performance Optimized = Off Remote = On No No No Yes Yes No
Performance Optimized = Off Remote = Off Yes Yes No Yes Yes No, unless populated via MV/Notebook
Performance Optimized = On Remote = Off Yes Yes Yes Yes Yes Yes

Text file or text file directory properties

Here are some of the properties specifically related to selecting a Text (.csv, .tsv, .tab, .txt) file or text file directory:

Property Control Description
Has Header? toggle Select if the first row contains column header values
Rows to skip numerical input Select the number of rows in a file to skip. The default is 0.
File Path text box Enter the relative path to the root directory as specified in the data source.
Example: SALES/Q1.csv
Date Format drop down list Select the format for date values in the file.
Timestamp Format drop down list Select the format for timestamp value in the file.
Character Set drop down list Select the character set of the Text file.
Separator drop down list Select the character used for line separation.
Other text box This property is available when the Separator is set to Other. Enter one or more characters to specify the column separator or delimiter between values in a row.
Enable Chunking toggle Enable this property for large file sizes
Chunk Size (MB) text box Enter a value in megabytes (MB) to specify the chunk size

Excel file or excel file directory properties

Here are the specific properties for an Excel Workbook (.xlsx) file or Excel directory:

Property Control Description
Worksheet drop down list Select a given worksheet for the Excel Workbook.
Update file text box / button With Incremental enabled, enter the relative file path of the desired update file. When adding to an existing table, the Select button opens the Add File dialog. The Add File dialog shows the files from your Azure Gen2 data source. Select a single file and select Add.
Update Worksheet text box With Incremental enabled and Wildcard Union disabled, select the desired worksheet in the update file.
Important

This release has limited support for Union Files for Excel Workbook (.xlsx) files. The Loader Service only loads Worksheets with the same name as defined in the table data source properties. For this reason, each Excel Workbook file in the selected folder must have a common Worksheet tab name. You must select this common Worksheet name in the drop down list.

Parquet file or parquet directory properties

Here are the properties specific to a parquet (.parquet) file or parquet directory:

Property Control Description
Read data as partitions toggle Enable this property to have data read as parquet partitions.

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of schemas, select the Azure Gen2 schema.
  • In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the Azure Gen2 schema using the Schema Designer:

Incorta Direct Data Platform

  • In the Navigation bar, select Schema.
  • In the list of schemas, select the Azure Gen2 schema.
  • In the Schema Designer, in the Action bar, select Load → Load Now → Full.
  • To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the Azure Gen2 schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Schema Manager, in the List view, select the Azure Gen2 schema.
  • In the Schema Designer, in the Action bar, select Explore Data.

© Incorta, Inc. All Rights Reserved.