Release Notes 4.6

Incorta 4.6 introduces several key improvements to the Cluster Management Console, Incorta Loader Service, and the Incorta Analytics Service. This release also includes the incorta_ml machine learning library for PySpark that you can use with or without the Notebook Add-on and the Incorta Labs Notebook integration. In addition, the 4.6 release includes other Incorta Labs offerings such as Enable Custom Themes, and the Inspector Tool Scheduler.

Release Highlights

There are several major features in this release including:

Improved Data Lake and Cloud Data Sources
Notebook Integration for Materialized Views
Incorta Machine Learning library for PySpark
Additional improvements and enhancements

Improved Data Lake and Cloud Data Sources

This release offers improved and expanded connectivity to new data lake data sources including cloud data lakes and cloud applications, along with the ability to specify folders of data files and load them incrementally using a lexicographic (timestamp) naming convention.

Notebook Integration for Materialized Views

Previously, when editing code for a materialized view in a given schema, you had to use a basic editor. In this release, you can enable an Incorta Labs feature for Notebook Integration that allows you to write code (Spark SQL or PySpark) using a Notebook interface for materialized views

Each paragraph in a notebook contains a code section and a result section. You can easily execute your code and view the resulting output within the paragraph. In addition, a notebook can consist of one or more paragraphs for sequential code execution. This new feature allows you to iteratively code and explore your data before saving code for export to the materialized view.

Incorta ML

For your PySpark materialized views, you can now rapidly apply machine learning to your schemas for predictive analytics, time series forecasting, and anomaly detection using the incorta_ml library.

Additional Improvements and Enhancements

Simplified Cluster and Tenant Administration in the Cluster Management Console (CMC)
New Incorta Labs features including Custom Themes, Notebook Integration, Inspector Tool, and Dark Mode theme
Incorta Analytics and Loader Service user interface enhancements, data source enhancements, and performance enhancements

Cluster Management Console (CMC)

The following new configurations are available in the Cluster Management Console (CMC):

Pause all scheduled jobs
Materialized View application settings
Configuration for Amazon Simple Email Service (Amazon SES)
Enabling Single Sign-On as an authentication type for users
Setting Warmup Mode for most-used dashboard columns

To sign in to the CMC, visit your CMC host at one of the following:

http://<Public_IP>:6060/cmc
http://<Public_DNS>:6060/cmc
http://<Private_IP>:6060/cmc
http://<Private_DNS>:6060/cmc

The default port for the CMC is 6060. Sign in to the CMC using your administrator username and password.

Pause all scheduled jobs

For the selected Incorta Cluster, you can now enable pausing scheduled jobs in both the default tenant and specific tenant configurations for loading data. Scheduled jobs include schema loads, dashboard, and alerts.

Enable this setting to pause active scheduled schema loads, dashboards, and data alerts. This is helpful when importing or exporting an existing tenant. You can resume active scheduled jobs by disabling this option or manually starting them in the Incorta scheduler.

Here are the steps to enable this option as default tenant configuration:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Data Loading.
Enable the Pause Scheduled Jobs setting.
Select Save.

Here are the steps to enable this option for a specific tenant configuration:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given tenant, select Configure.
In the left pane, select Data Loading.
Enable the Pause Scheduled Jobs setting.
Select Save.

Some SMTP servers such as AWS SES require username/password pairs to be provided. In 4.6, the user has both options either to provide a username and password, or the sender’s email and password.

Default Materialized View Application settings

For the selected Cluster, you can now set Materialized Views default values for Apache Spark Integrations:

Materialized view application cores
Materialized view application memory
Materialized view application executors

The Spark Integrations settings are global to all tenants in a cluster configuration.

Materialized view application cores

The number of CPU cores reserved for use by materialized view. The default value is 1. The allocated cores for all running Spark applications cannot exceed the dedicated cores for the cluster.

Materialized view application memory

The number of gigabytes of maximum memory to use for materialized view. The default is 1 GB. The memory for all Spark applications combined cannot exceed the cluster memory (in gigabytes).

Materialized view application executors

Maximum number of executors that can be spawned by a single materialized view application. Each of the executors will allocate a number of the cores defined in sql.spark.mv.cores, and will consume part of the memory defined in sql.spark.mv.memory. Note that the cores and memory assigned per executor will be equal for each executor, hence the number of executors should be a divisor for each of the following configurations (sql.spark.mv.cores and sql.spark.mv.memory). For example, when you configure an application with cores=4, memory=8, executors=2, that result is that the Spark will spawn 2 executors where each executor consumes 2 cores and 4GB from the cluster).

Here is how you can modify these settings and their default values:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Server Configurations.
In the left pane, select Spark Integration.
Set the value for a given Materialized view application setting:
- Materialized view application cores
- Materialized view application memory
- Materialized view application executors
Select Save.

Support for Amazon Simple Email Service (Amazon SES)

Some SMTP email solutions such as Amazon SES require username/password pairs. In this 4.6 release, you can now configure the SMTP host to use a Sender’s Username Authentication.

To enable this option as default tenant configuration in the CMC, follow these steps:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Email.
In the right pane, toggle Sender’s Username Auth to enabled.
In System Email Username, enter the SMTP username.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given tenant, select Configure.
In the left pane, select Email.
In the right pane, toggle Sender’s Username Auth to enabled.
In System Email Username, enter the SMTP username.
Select Save.

Enable the ability to add a Single Sign-On (SSO) user from Incorta Analytics

You must first configure the Incorta Node hosting the Incorta Analytics service to support Single Sign-On (SSO). Please see the Secure Login Access document for SSO configuration.

For the selected Cluster, you can now enable Single Sign-On as the authentication type for user authentication. When the authentication type is set to SSO, users that belong to the SuperRole or the User Manager role in the Incorta Analytics service have the ability to add new users and set the Profile Authentication Type to SSO. Once set to SSO, a user password is no longer required. It is possible to switch back authentication methods for a given user to Incorta authentication.

Here are the steps to enable this option as default tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Security.
In the right pane, for Authentication Type, select SSO from the dropdown.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given tenant, select Configure.
In the left pane, select Security.
In the right pane, for Authentication Type, select SSO from the dropdown.
Select Save.

Warmup Mode for Most Used Dashboard Columns

Warmup Mode affects how the Incorta Analytics service loads schema data into memory when starting up. In this release, there is a new option for Warmup Mode: Most Used Dashboard Columns.

When selected, the Most Used Dashboard Column value makes available a secondary setting: Maximum (%) of Memory Intended for Warmup. The default value is 20%. The minimum value is 0% and the maximum value is 75%.

Here are the steps to enable this option as default tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Advanced.
In the right pane, Warmup, select Most Used Dashboard Column from the dropdown.
In Maximum (%) of Memory Intended for Warmup, enter a value between 0 and 75.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given Tenant, select Configure.
In the left pane, select Advanced.
In the right pane, in Warmup, select Most Used Dashboard Column from the dropdown.
In Maximum (%) of Memory Intended for Warmup, enter a value between 0 and 75.
Select Save.

Incorta Labs

Incorta Labs are experimental features and functionality that Incorta supports for non-production use. As such, some experimental features will potentially become part of an Incorta release and others potentially will be deprecated. Incorta Support will investigate issues with Incorta Labs features.

You can enable various Incorta Labs features in the Cluster Management Console. An Incorta Labs feature may require additional configurations and may require restarting Incorta Services and Incorta Add-ons.

In this release, there are several significant new features in Incorta Labs:

Enable Custom Themes
Enable Insight View As Menu
Inspector Tool Scheduler
Notebook Integration

To enable or configure an Incorta Labs feature, you must sign in to the Cluster Management Console (CMC). To sign in to the CMC, visit your CMC host at one of the following:

http://<Public_IP>:6060/cmc
http://<Public_DNS>:6060/cmc
http://<Private_IP>:6060/cmc
http://<Private_DNS>:6060/cmc

The default port for the CMC is 6060. Sign in to the CMC using your administrator username and password.

Enable Custom Themes

Enable this feature to let individual users set their default appearance to a Dark Theme (Dark Mode). The Dark Theme is not applicable to all user interfaces in the Analytics Services, including the Analyzer, Schema Designer, Table Editor, and Join Editor.

Here are the steps to enable this option as default tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Incorta Labs.
In the right pane, toggle Enable Custom Themes to enabled.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given Tenant, select Configure.
In the left pane, select Incorta Labs.
In the right pane, toggle Enable Custom Themes to enabled.
Select Save.

To learn more about enabling custom themes, see User Interface Configurations.

Dark Theme for Users

With Custom Themes enabled, you in the Analytics Service can toggle on or off the Dark Theme. Here are the steps to toggle the Dark Theme in the Analytics Service for a user:

To open the Profile Menu, in the Navigation bar, select Profile.
In the Profile Menu, select the User.
In the Edit User drawer, select the Appearance tab.
Enable or disable the Dark Theme toggle.

The Enable Insight View As Menu Incorta Labs feature allows a dashboard consumer to view dashboard insight chart visualization as a table or aggregated table. Here are the steps to enable this option as default tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Incorta Labs.
In the right pane, toggle Enable Insight View As Menu to enabled.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given Tenant, select Configure.
In the left pane, select Incorta Labs.
In the right pane, toggle Enable Insight View As Menu to enabled.
Select Save.

Once enabled, for a chart visualization insight on a given dashboard, you can select, in the Actions menu, the More Options icon (Kebab), and then open the More Options menu. In the More Options Menu, you can select the View As Table option, and select either Regular or Aggregated. To return back to the insight visualization, in the Actions Menu, select the Return (Rollback) icon.

Inspector Tool Scheduler

In this release, in Incorta Labs, you can enable the Inspector Tool to run on a schedule as a default tenant configuration or for a specific tenant configuration. To view and explore the results of the scheduled job in Incorta Analytics, you can also download and import the related Inspector Tool dashboards, schema, and business schema.

About the Inspector Tool

The Incorta Inspector Tool checks the lineage references of Incorta metadata objects including tables, schemas, business schemas, business schema views, dashboards, and session variables. The Inspector tool also checks for inconsistencies and validation errors such as:

An invalid join due to mismatched data types or unsupported data types
A join with a missing table
A join with a missing column
A join on a parent table column that is not a key column
A join on a child table with multiple parent tables
A join using an invalid formula column
A join on a formula column that references columns in two or more schemas
A cyclical join between two or more tables such as A > B > C > D > A that can be resolves with a table alias
Between two tables, multiple join paths
A table enabled for incremental loads but without incremental logic specified
A table enabled for incremental loads with incremental logic specified but no key column specified
A table with a runtime security filter that references a missing session variable or the session variable has a missing definition
An alias table with no existing base reference table
An alias table out of sync with the existing base reference table
A formula that refers to a column that does not exist
A formula that that references columns in two or more schemas
A business schema view that references a column in table that does not exist
A session variable that references another session variable that does not exist
A dashboard that references a missing session variable
A dashboard that references a missing table or business schema view column

Download the Inspector Tool Dashboards, Schema, and Business Schema

For a given tenant with the Inspector Tool Scheduler enabled, you need to first download the Inspector Tool Dashboards, Schema, and Business Schema. Here are the steps to download the Inspector Tool Dashboards, Schema, and Business Schema:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Incorta Labs.
In the right pane, in the description of the Inspector Tool Scheduler, select the download link.
In the Box folder, select the following files to download:
- dashboards.zip
- business_schema.zip
- schema.zip

After successfully downloading the zip files, you must import the schema, business schema, and dashboards into a given tenant.

Import the Inspector Tool Schema

Here are the steps to import the Inspector Tool schema for a given tenant:

In the Navigation bar, select Schema.
In the Action bar, select + New.
In the Add New Menu, select Import Schema.
Drag and drop the dashboards.zip file to the Import Schema dialog.
In the Import Results dialog, verify the schema name, InspectorMetadata, and select Close.

The InspectorMetadata schema contains the following tables:

BUSINESS_SCHEMA_VIEWS
JOINS_DETAILS
LINEAGE_REPORT
MV_REFERENCED_TABLES
SCHEMA_TABLES
VALIDATION

Import the Inspector Tool Business Schema

Here are the steps to import the Inspector Tool business schema for a given tenant:

In the Navigation bar, select Business Schema.
In the Action bar, select + New.
In the Add New Menu, select Import Business Schema.
Drag and drop the business_schema.zip file to the Import Business Schema dialog.
In the Import Results dialog, verify the schema name, incortaInspector, and select Close.

The incortaInspector business schema contains the following folders and views:

TenantHierarchy (folder)
- DashboardLineage
- Schemas
- Joins
- BusinessSchemas
- MVs
Validation (view)

Import the Inspector Tool Dashboards

Here are the steps to import the Inspector Tool dashboards for a given tenant:

In the Navigation bar, select Content.
In the Action bar, select + New.
In the Add New Menu, select Import Folder/Dashboard.
Drag and drop the dashboards.zip file to the Import Folder/Dashboard dialog.

In the InspectorTool folder, there are several Inspector Tool dashboards:

0- Run status
1- Validation UseCases
2- Unused Entities
3- Schemas Details
4- Dashboards Lineage Summary
5- Tables Used in Business Views
6- Tables Used In Materialized Views

Enable the Inspector Tool Scheduler

Having successfully imported the Inspector Tool Dashboards, Schema, and Business Schema, you can now enable the Inspector Tool Scheduler in the Cluster Management Console as both a default tenant configuration or a tenant configuration.

Here are the steps to enable this option as default tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Incorta Labs.
In the right pane, toggle Inspector Tool Scheduler to enabled.
Specify the schedule.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given tenant, select Configure.
In the left pane, select Incorta Labs.
In the right pane, toggle Inspector Tool Scheduler to enabled.
Specify the schedule.
Select Save.

Notebook Add-on

The Notebook Integration Incorta Labs feature requires that Apache Spark is running and properly configured for the Incorta Cluster instance.

A notebook is an interactive environment for creating a materialized view in a given schema. As an interactive notebook environment, you can execute individual paragraphs, view a table of query results, and visualize results as a bar chart, pie chart, area chart, line chart, or a scatter chart.

In Incorta 4.6, a notebook defined materialized view supports two interoperable languages, SQL and Python. This means that one paragraph can be in SQL and another Python.

Apache Spark executes all materialized views. Apache Spark natively runs Spark SQL queries using columnar data stored as Apache Parquet files in Shared Storage (Staging).

Notebook Integration

Before using a Notebook to create a materialized view in a schema, you must first integrate the Notebook into an Incorta Cluster. Notebook Integration requires the completion of several key tasks in the CMC:

Create Notebook Add-on service
Set the Notebook Integration properties in Server Configurations
Enable the Notebook Integration
Start the Notebook service

There are several requirements for implementing the Incorta Labs Notebook Integration:

Supported Linux Operating System
Apache Spark 2.4.3 must already be configured for the Incorta Cluster and must be running
The Incorta Node hosting the Notebook Add-on requires Python 2.7, Python 3.6, or Python 3.7. Python 3.8 is not yet supported.
On the Incorta Node hosting the notebook, the default port 5500 must be open or the configured port must be open.

Add-ons

An Incorta Cluster can only have a single Notebook Add-on. You can install the Notebook Add-on during a new installation or after an installation.

There are two types of cluster installations, Single Host which is a Standalone instance using the Typical installation method and Multi-host which requires a Custom installation. Both cluster typologies are applicable to Incorta Notebooks.

During a Single Host (Typical) Installation

Here are the steps to configure and install a Notebook Add-on during a Single Host (typical) Installation:

In the Configuration Wizard, for Add-ons, specify the Notebook port value (the default 5500).
To continue the Configuration Review, select Next.
Select Create.

After a Single Host (Typical) Installation

Here are the steps to configure and install a Notebook Add-on after a Single Host (Typical) or Multi-host (Custom) installation:

In the Navigation bar, select Nodes.
In the nodes list, select the localNode.
In the canvas, select the Add-ons tab.
To create a Notebook, in the Add-ons header, select + (Add).
In the Create a new notebook dialog, enter the Port number. The default value is 5500.
Select Save.

Multi-host (Custom) Installation

Here are the steps to configure and install a Notebook Add-on for a custom installation:

In the Navigation bar, select Nodes.
In the nodes list, select an Incorta Node.
In the canvas, select the Add-ons tab.
To create a Notebook, in the Add-ons header, select + (Add).
In the Create a new notebook dialog, enter the Notebook Name and the Port number. The default value is 5500.
Select Save.

Notebook Integration settings

The Notebook Integrations settings are global to all tenants in a cluster configuration. Before starting the Notebook, you must:

Set the Notebook Integration properties in Server Configurations
Enable the Notebook Integration

For the selected Cluster, you can set the default values for the Notebook integration:

Notebook Max Cores
Notebook Max Memory

Notebook Max Cores Maximum amount of memory to use for all notebook executors.

Notebook Max Memory Maximum amount of memory to use for all notebook executors, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g). Here is how you can modify these settings and their default values:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Server Configurations.
In the left pane, select Notebook Integration.
Set the value for a given Notebook setting:
- Notebook Max Cores
- Notebook Max Memory
Select Save.

Enable the Notebook Integration

After Notebook Integration properties are set, you must enable the Incorta Labs Notebook feature. Here are the steps to enable this option as default tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Cluster Configurations.
In the panel tabs, select Default Tenant Configurations.
In the left pane, select Incorta Labs.
In the right pane, toggle Notebook Integration to enabled.
Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select the Tenants tab.
In the Tenant list, for the given Tenant, select Configure.
In the left pane, select Incorta Labs.
In the right pane, toggle Notebook Integration to enabled.
Select Save.

Start, Stop, and Restart Notebook

Here are the steps to start, stop, and restart a Notebook:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Add-ons.
In the nodes list, select the Notebook name.
In Notebook details, select Restart, Stop, or Start.

Editing the Notebook Port

You must restart the notebook after changing the Notebook port as follows:

In the Navigation bar, select Clusters.
In the cluster list, select a Cluster name.
In the canvas tabs, select Add-ons.
In the nodes list, select the Notebook name.
In Notebook details, in the title, select Edit.
Change the Port value.
Select Update.

Post Notebook Integration

Once configured and running, you can create and edit a materialized view in a given schema with a Notebook.

Incorta Analytics and Loader Service

The 4.6 release introduces several key improvements to the Incorta Analytics and Loader Services such as:

Expanded support for various Data Sources including Data Lake, Query Services, Data Folders, and Uploading of multiple files and folders
Schema Table configuration for incremental loads using timestamp file naming, directory folders and subfolders, and post extraction webhook callbacks
Ability to preview data from the Table Editor
Seamless Incorta Labs Notebook Integration for Materialized Views
User interface enhancements for Incorta Analytics including the Data Manager, Schema Manager, Table Editor, and Analyzer

Optionally Persist and SQLi Result

In Apache Spark, submitted jobs often persist a dataframe so as to preserve any data transformation, calculations, or aggregations for future tasks in a job.

sql.spark.persist.level

In this 4.6 release, the SQLi interface determines if an executed task will persist, and if that is the case, control how to persist the dataframe. There are three valid values for the sql.spark.persist.level property:

never: Indicates that the dataframe will never persist. Use this setting value for diagnosis and troubleshooting
always: Indicates that the dataframe will always persist. This is the default value.
query: Indicates that the SQLi interface will check the Apache Spark query plan. If there are simple task stages in the query plan such access access a single table or applying a simple filter, then the dataframe will not persist. However, if the Spark query plan is complex and has, for example, numerous shuffles and broadcasts, the dataframe will persist.

Data Sources

In the 4.6 release, the user interface for adding a new data source is new. In addition, new data source choices exists, including choices for:

Oracle Cloud Applications
Google BigQuery
Salesforce v2

Oracle Cloud Applications

The Oracle Cloud Applications Connector extracts data from Web Cloud Content (WCC) that the Oracle Business Intelligence Cloud Connector Console compresses in comma separated value (CSV) file format.

Here are the steps to add an Oracle Cloud Applications data source in the Analytics Service:

In the Navigation bar, select Data.
In the Actions bar, select + New, then select Add Data.
In the Choose a Data Source dialog, in Application, select Oracle Cloud Applications.
In the New Data Source dialog, specify the:
- Data Source Name
- Username
- Password
- Oracle Cloud Applications URL
- Root Query Text
- Data Type Discovery Policy
- File Name Pattern
- File Criteria - Last Modified Timestamp
To test, select Test Connection.
Select Ok to save your changes.

The Data Type Discovery Policy defines the Metadata Definition files. These files must be uploaded first to Incorta data files, and must have *.csv extension.

The File Criteria - Last Modified Timestamp property acts as a time filter for all the results concerning this data source. For example, >= ‘2019-05-31 15:30’ will return all the files created or modified after this date.

Google BigQuery Data Source

To analyze data housed in Google Storage, first create a BigQuery Data Source. Before implementing a BigQuery data source, you must first download and configure the BigQuery driver for Incorta. The driver is in a JAR file. The BigQuery JAR file must exist in both the CMC and the Incorta Services installation path.

Here are the steps to create a Google BigQuery data source in the Analytics Service:

In the Navigation bar, select Data.
In the Actions bar, select + New, then select Add Data.
In the Choose a Data Source dialog, select BigQuery.
In the New Data Source dialog, specify the:
- Data Source Name
- Username
- Password
- Project ID
- Path of the json file key downloaded from google cloud service accounts
To test, select Test Connection.
Select Ok to save your changes.

There are no changes for creating a schema using a BigQuery data source from previous versions of Incorta.

Salesforce v2 Data Source

To analyze data in Salesforce, first create a Salesforce (v2) data source. The Salesforce v2 data source connector uses a REST API interface and overcome limitations with the Salesforce data source connector, version 1, which employs a SOAP API interface.

Before implementing a Salesforce data source, you must first download and configure the Salesforce v2 driver for Incorta. The driver is in a JAR file. The Salesforce v2 JAR file must exist in both the CMC and the Incorta Services installation path.

Here are the steps to create a Salesforce v2 data source in the Analytics Service:

In the Navigation bar, select Data.
In the Actions bar, select + New, then select Add Data.
In the Choose a Data Source dialog, in Other, select Salesforce (v2).
In the New Data Source dialog, specify the:
- Data Source Name
- Username for Salesforce
- Password for Salesforce
- Token for Salesforce Authentication
- Optionally specify a Proxy:
  - Proxy Host
  - Proxy Port
  - Proxy Username
  - Proxy Password
To test, select Test Connection.
Select Ok to save your changes.

There are no changes for creating a schema using a Salesforce v2 data source from previous versions of Incorta.

Azure Data Lake Storage (ADLS) Gen2 Authentication Support for Service Principal authorization

For an Azure Data Lake Storage (ADLS) Gen2 data source, you can now specify Service Principal as an Authentication Type.

An Azure Active Directory service principal is an identity for an application that needs to access or modify resources using Role-Based Access Control (RBAC).

To learn more about creating an Azure Active Directory Service Principal in your Azure Portal, visit How to Create a Service Principal

Your Azure Portal contains the required details for this configuration:

Client ID
Client Secret Key
Tenant ID

To configure authentication support Service Principal for authorization, follow these steps:

In the Navigation bar, select Data.
In the Actions bar, select + New, then select Add Data.
In the Choose a Data Source dialog, in Data Lake, select Data Lake - Azure Gen2.
In the New Data Source dialog, specify the following:
- Data Source Name
- Authentication Type = Service Principal
- Client ID
- Client Secret Key
- Tenant ID
- Directory
Select Ok to save your changes.

Data Folders

In 4.6 release, Incorta Analytics users can now create, upload, share, and delete Data Folders. A Data Folder can contain one or more Data Files and Data Folders. Here are the steps to Create a Data Folder:

In the Navigation bar, select Data.
In the Action bar, select + New, then select Create Folder.
In the Add Folder dialog, enter the Folder name.
The new Data Folder appears in the Local Data File tab.

Upload a Folder of Data Files

In 4.6 release, you can upload a Folder local to your machine that contains Data Files into Incorta. Incorta uploads the entire folder hierarchy of subfolders and files in original form. Incorta only uploads supported data file types: CSV, TSV, TXT, XLS, and XSLT. Incorta ignores any empty folders as well as duplicate files unless the Overwrite option is enabled. Here are the steps to upload multiple files:

In the Navigation bar, select Data.
In the Actions bar, select + New, then select Add Data.
In the Choose a Data Source dialog, in Data Files, select Upload Data Folder.
In the Upload Data Folder dialog, in Upload Options, optionally select Overwrite existing file.
In the Upload Data Folder dialog, drag and drop a Folder.
Select Upload.

You can only share a top level Data Folder. All child items — folders and data files — inherit the same shared access rights.

Delete a Data Folder

You can only delete a folder you have Edit permissions for the folder. Deleting a folder also deletes child data folders and data files.

Upload load Multiple Files

In 4.6 release, Incorta Analytics users can upload one or more Data Files to the Local Data Files. Duplicate files are ignored unless the Overwrite option is enabled. Here are the steps to upload multiple files:

In the Navigation bar, select Data.
In the Actions bar, select + New, then select Add Data.
In the Choose a Data Source dialog, in Data Files, select Upload Data File.
In the Upload Data File dialog, in Upload Options, optionally select Overwrite existing file.
In the Upload Data File dialog, drag and drop one or more files.
Select Upload.

Create a Data Source with incomplete or invalid details

In this release, you can now create or edit a data source with incomplete or invalid details. An Error dialog simply reports that some fields are invalid. To save the invalid settings, select Save anyway.

Support legacy MS Excel file formats as a File System Data Source

Incorta now supports more Microsoft Excel file formats:

Excel Workbook (.xlsx)
Excel 97-2003 Workbook (*.xls)
Microsoft Excel 5.0/95 Workbook (*.xls)

Schema

In the Incorta 4.6 release, there are several enhancements for Schema optimization and configuration such as:

Directory Selection for Data Lake folders in the Schema Wizard
Incremental extracts for a Data Lake table that uses timestamp file naming
Preview of data in the Table Editor
Post Extraction Callback with Webhooks

Schema Wizard Supports Directory Selection

In this release, the Schema Wizard supports the selection of a folder directory. The requirement is that the folder directory exists as either a Local Data Files folder or a Data Lake data source folder. The Schema Wizard automatically configures the table data source to use a Directory, including all Subdirectories Files and Union Files. Follow these steps to select a folder directory for Local Data Files folder using the Schema Wizard:

In the Navigation bar, select the Schema tab.
In the Action bar, select + New.
In the Add New Menu, select Schema Wizard.
In the Add Schema Wizard, in (1) Choose a Source, enter a unique Schema Name.
In Select a Datasource, select LocalFiles.
Optionally enter a Schema description.
Select Next.
In (2) Manage Tables, in the Selection Panel, select the folder directory.
Verify the columns.
Click Next.
In (3) Finalize, leave checked the “Create joins between selected tables if foreign key relationships are detected” checkbox.
Click Finish.

Incremental Extracts Using a Timestamp in File Names for a Data Lake table

In this release, for a table in a schema that uses a Data Lake data source, you can adopt an incremental data loading strategy that relies on a timestamp in the file name itself. The requirement is that all files in the Data Lake source, such as a S3 bucket, employ a consistent naming convention that includes a timestamp format for lexicographic comparison. The file name can be the timestamp itself or the timestamp as the file name suffix. The supported timestamp formats are:

yyyy-MM-dd
dd.MM.yyyy
dd-MMM-yy
dd-MMM-yyy
yyyy-MM-dd HH.mm.ss
Unix Epoch (seconds)
Unix Epoch (milliseconds)

Here are examples of several Data Lake files in a AWS S3 bucket that support incremental loads scheduled for every 30 minutes with naming convention using the “yyyy-MM-dd HH.mm.ss” timestamp format:

transactions_2020-01-21 09.28.01.csv
transactions_2020-01-21 10.08.10.csv
transactions_2020-01-21 10.28.15.csv

Incorta will ignore files with a non-conforming timestamp file name.

To enable and configure an incremental extract for a Data Lake table using a timestamp file name, follow these steps:

For a given schema, in the Schema Designer, select the Data Lake table, and open the schema Table Editor.
In the Table Editor, in the Summary section, select the Table Data Source.
In the Data Source dialog, toggle Incremental to enabled.
In** Incremental Extract Using, select **Timestamp in File Name.
In Timestamp format in file name, select a timestamp format.
In Directory Path, select the directory path relative to the root of the Data Lake data source path.
Select Add.

Preview Data in the Table Editor

In this release, you can now preview table data from the Table Editor from the Columns section. The Preview dialog shows sample data from the table along with column level statistics including MINVALUE, MAXVALUE, and #NULLS.

Here are the steps view to Preview Data:

For a given schema, in the Schema Designer, select a table to open the Schema Table Editor.
In the Table Editor, in the Columns section, select Preview data.
In the Preview dialog, review the columns statistics and sample rows.
To close the Preview dialog, select X.

Preview Data also supports previewing data for Data Lake data source tables that are configured as Remote tables.

Post Extraction Callback with Webhooks

In this release, for a given table in a schema, you can specify a post extraction callback to an external service, application, or endpoint using Webhooks. Incorta supplies the following extraction parameters for the Webhook callback:

extractionDuration
rejectedRows
loadType
extractedRows
state
schemaName
extractStart
tableName

Here are the steps to enable and configure a Callback for a table data source:

For a given schema, in the Schema Designer, select a specific Table, and open the Schema Table Editor.
In the Table Editor, in the Summary section, select a Table Data Source.
In the Data Source dialog, enable Callback.
With Callback enabled, enter the Callback URL.
Select Save.

You can verify a post extraction callback with these steps:

In a separate browser tab or window, visit http://webhook.site.
Select Copy to clipboard.
For the table data source, in the Data Source dialog, enable the Callback and paste the copied webhook URL into the Callback URL textbox.
In the Data Source dialog, select Save.
In Schema Designer, for the given table, in More Options, select Load Table or Load Staging.
Return to the webhook URL in the browser to view the result in Form Values.

Oracle Database and MySQL Database Data Source support for Chunking by Timestamp

Incorta chunking is a table data source configuration that allows for parallel data extraction. The parallel execution significantly helps extract rows from very large tables.

In this release, for both Oracle and MySQL data source tables, you can now select the Chunking Method the By Timestamp option.

When selected, you must specify the following:

Order Column, which must be a Date or Timestamp data type and determines how to order the table before the chunking extraction
Chunk Period, which determines the boundary of the chunks as either Daily, Weekly, Monthly, or Annually.

In addition to the Order Column and Chunk Period, there are two optional properties:

Upper Bound, serves as the end date or timestamp for the Order Column order and the value needs to be in the format, “yyyy-MM-dd HH:mm:ss.SSS”
Lower Bound, serves as the start date or timestamp for the Order Column order and the value needs to be in the format, “yyyy-MM-dd HH:mm:ss.SSS”

Chunking Performance Enhancement

Chunking threads now honors the connection pool size and will not exceed the connection pool. If the number of chunks is more than the connection pool, the Incorta Loader Service will queue the chunks. In addition, the timeout for chunk fetching is unlimited.

Schema Isolation Protection

While a schema is loading data or is scheduled to load data, in this release, Incorta prevents schema deletes, imports or updates.

Materialized Views with Notebooks

Notebook Integration is an Incorta Labs feature and requires additional configuration in the CMC. Please refer to the Notebook Add-on section in this release notes for more details about Notebook integration and configuration.

The Incorta Labs Notebook integration supports the following actions:

Create a Materialized View using a Notebook
Edit an existing Materialized View using a Notebook

Create a Materialized View using a Notebook

The selected Language is the language that Notebook will export to the materialized view. To create a materialized view using a Notebook, follow these steps:

For a given schema, in the Actions bar, select + New.
In the Add New menu, select Materialized View.
In the Data Source dialog, in Language select SQL or Python.
In Script, select Edit in Notebook.
In the Edit Notebook (selected language) dialog, in at least one Paragraph, in the paragraph Code Section, enter the execution code.
Optionally run the paragraph or the notebook.
To exit the Edit Notebook dialog, select Done.
In the Data Source dialog, optionally select Add Property to configure specific Apache Spark properties for the materialized view.
Select Save.

Edit a Materialized View using a Notebook

To create a materialized view using a Notebook, follow these steps:

For a given schema, in the Schema Designer, select the materialized view table, and open the Schema Table Editor.
In the Table Editor, in the Summary section, select the Table Data Source.
In the Data Source dialog, in Script, select the textbox or open dialog icon.
In the Edit Notebook dialog, in at least one Paragraph, in the paragraph Code Section, edit the execution code.
Optionally run the paragraph or the notebook.
To exit the Edit Notebook dialog, select Done.
In the Data Source dialog, optionally select Add Property to configure specific Apache Spark properties for the materialized view.
Select Save.

Using Notebooks for Materialized Views

By design, a notebook is an interactive environment that allows you to explore, manipulate, and transform data iteratively and interactively. A notebook consists of one or more paragraphs. A paragraph consists of a code section and a result section. In the code section, you can use a language specific editor to write either PySpark or SQL code. You can execute code in the code section using paragraph commands. When there are executed results, you can view the output in the result section of the paragraph.

When editing a notebook, you can run a specific paragraph or all notebook paragraphs in the notebook.

The Notebook Add-on service runs as an application in Apache Spark. The Notebook Add-on service creates a notebook application in Apache Spark that manages the paragraph execution request. When running more than one paragraph, the Notebook Add-on service application processes each paragraph sequentially: when the first paragraph completes, the second is started.

Code Execution Language for a Notebook

When creating a materialized view, in the Data Source dialog, you must select a Language. The choices are SQL or Python. SQL represents the execution of SQL using the Spark SQL library. Python represents the execution of PySpark, which is the Python API for Spark.

Edit Notebook dialog

In the dialog title, the Edit Notebook dialog specifies the notebook language for export to the materialized view. The Edit Notebook dialog contains the notebook layout. The notebook layout consists of a toolbar bar and one or more interactive paragraphs.

Save the Notebook

To save your changes to a notebook, in the Edit Notebook dialog, select Done. When you save your changes in the Edit Notebook dialog, two events occur:

Incorta saves all the changes to the interactive notebook as a notebook file internally
From the notebook, Incorta exports only the specified Language paragraphs to the materialized view.

Anatomy of a Notebook

A notebook consists of the Notebook Toolbar and one or more paragraphs. You can use the Toolbar and related keyboard shortcuts to interact with the notebook environment.

In the notebook toolbar, you can:

Run all paragraphs
Show/Hide this code
Show/Hide this output
Clear output
Search code
List keyboard shortcuts

Run all paragraphs

The Notebook Add-on service runs as an application in Apache Spark. When you select Run all paragraphs, the notebook submits each paragraph in sequential order to the Notebook Add-on service. The Notebook Add-on service creates a notebook application in Apache Spark that manages the execution requests. Each paragraph is processed sequentially: when the first paragraph completes, the second is started.

Show/Hide this code

Use the Show/Hide this code notebook toolbar command to toggle the visibility of all paragraph code sections in the notebook.

Show/Hide this output

Use the Show/Hide this output notebook toolbar command to toggle the visibility of all paragraph result sections in the notebook.

Clear output

Use the Clear output notebook toolbar command to clear all paragraph result sections in the notebook. A dialog will confirm the command request.

Search code

Use the Search code toolbar command to open the Find and Replace dialog. The scope of Search is for paragraph code sections only. You can navigate between search term occurrences, replace a single search term occurrence with a new term, or replace all search term occurrences with a new term.

List keyboard shortcuts

Use the List keyboard shortcuts toolbar command to open the Keyboard shortcuts dialog. The Keyboard shortcuts dialog contains the following:

Note(book) Keyboard Shortcuts

Command	Keystroke
Run paragraph	Shift + Enter
Run all above/below paragraphs	Ctrl + Shift + Enter
Cancel	Ctrl + Option + C
Move cursor Up	Ctrl + P
Move cursor Down	Ctrl + N
Remove paragraph	Ctrl + Option + D
Insert new paragraph above	Ctrl + Option + A
Insert new paragraph below	Ctrl + Option + B
Insert copy of paragraph below	Ctrl + Shift + C
Move paragraph Up	Ctrl + Option + K
Move paragraph Down	Ctrl + Option + J
Enable/Disable run paragraph	Ctrl + Option + R
Toggle output	Ctrl + Option + O
Toggle editor	Ctrl + Option + E
Toggle line number	Ctrl + Option + M
Toggle title	Ctrl + Option + T
Clear output	Ctrl + Option + L
Link this paragraph	Ctrl + Option + W
Reduce paragraph width	Ctrl + Shift + -
Increase paragraph width	Ctrl + Shift + +

Editor Keyboard Shortcuts

Command	Keystroke
Auto-completion	Ctrl + .
Cut the line	Ctrl + K
Paste the line	Ctrl + Y
Search inside the code	Ctrl + S
Move cursor to the beginning	Ctrl + A
Move cursor at the end	Ctrl + E
Find in code	Ctrl + Option + F1

Anatomy of a Paragraph

By design, a notebook encourages you to explore data iteratively and interactively using the construct of a paragraph. A paragraph consists of:

Paragraph Status
Paragraph Commands
Paragraph Title
Code Section
Result Section

Paragraph Status

When executing the code section, you can view the various statuses of a given paragraph. The statuses are:

Ready
Pending
Running
Error
Finished

Paragraph Commands

Paragraph commands include the following:

Run this paragraph (Shift + Enter)
Show/Hide editor (Ctrl + Option + E)
Show/Hide output (Ctrl + Option + O)
Configure the paragraph

Run this Paragraph

The Notebook Add-on service runs as an application in Apache Spark. When you select Run this paragraph, the notebook submits the paragraph code to the Notebook Add-on service which in turn manages the code execution in Apache Spark.

Show/Hide editor

Use the Show/Hide editor paragraph command to toggle the visibility of the paragraph code section.

Show/Hide this output

Use the Show/Hide output paragraph command to toggle the visibility of the paragraph result section.

Configure the Paragraph

You can use the Configuration Menu or in some cases, use keystroke combination to configure the paragraph. Here are the options and related keystrokes for configuring a paragraph:

Width
Font size
Move down (Ctrl + Option + J)
Insert new (Ctrl + Shift + B)
Run all below (Ctrl + Shift + Enter)
Clone paragraph (Ctrl + Option + C)
Show title (Ctrl + Option + T)
Show line numbers (Ctrl + Option + M)
Disable run (Ctrl + Option + R)
Link this paragraph (Ctrl + Option + W)
Clear output (Ctrl + Option + L)
Remove (Ctrl + Option + D)

Paragraph Title

Although not required, a paragraph can have a descriptive title. Use the Toggle title keystroke (Ctrl + Option + T) to show or hide the paragraph title.

Code Section

In a Code Section, you can:

Specify the language of execution, such as SQL (%sql) or Python (%pyspark)
Enter the code for execution using the editor

Although a notebook may contain both SQL and PySpark (Python for Spark) paragraphs, the Notebook exports only the code that is specific to the Data Source dialog language selection to the materialized view.

Code Editor

The Code Editor contains context-aware code completion. As you type, code completion will offer suggestions in a menu. As an example, for PySpark code, the code completion menu will offer choices for Python keywords, local variable references, and Incorta schema related objects.

Result Section

The result section contains the output from the execution of the code section. You can view the output of an executed paragraph. In the Action bar for the paragraph output you can view the output as a:

Table
Bar Chart
Pie Chart
Area Chart
Line Chart
Scatter Chart

The Result Section truncates output by default to 102400 bytes. Each viewing option has specific interactions and configurable settings.

Table Settings

In the output toolbar, select settings to view the Table Options:

useFilter
showPagination

useFilter

When selected, you can interactively filter rows by specify a column filter in the filter textbox of the column header.

showPagination

When selected, you can interactively page through rows using the footer pagination control. The control allows you to go to the first page, navigate to the next page, navigate back a page, or navigate to the end pages. The footer pagination control also allows you to specify the number of items (rows) per page: 25, 50, 100, 250, or 1000.

For a table, in the Column Header, you can open the More Options menu to:

Sort Ascending
Sort Descending
Hide Column
View the selected data Type
Group Table by the first column

In addition, you can also select columns in a Table to view.

Bar Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

keys
groups
values

Keys

Specify one or more column fields that define uniqueness.

Groups

Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For a Bar Chart, additional settings include xAxis selections, optionally Grouped or Stacked bars, and legend toggle interactions. The xAxis selections are:

Default
Rotate
Hide

Pie Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

keys
groups
values

Keys

Specify one or more column fields that define uniqueness.

Groups

Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For a Pie Chart, additional settings include legend toggle interactions.

Area Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

keys
groups
values

Keys

Specify one or more column fields that define uniqueness.

Groups Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For an Area Chart, additional settings include xAxis selections, optionally Stacked, Stream, Expand area definitions, and legend toggle interactions. The xAxis selections are:

Default
Rotate
Hide

Line Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

keys
groups
values

Keys

Specify one or more column fields that define uniqueness.

Groups

Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For a Line Chart, additional settings include force Y to 0, zoom, Date format, xAxis selections, and legend toggle interactions. The xAxis selections are:

Default
Rotate
Hide

Zoom allows you to select a specific range of values on the xAxis.

Scatter Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

xAxis
yAxis
group
size

xAxis

Specify a column field for the xAxis.

yAxis

Specify a column field for the yAxis.

group

Specify a column field for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify a column field for aggregation.

For a Scatter Chart, additional settings include legend toggle interactions.

Incorta functions for PySpark Code Sections

For a PySpark code section, you can use the following Incorta helper functions:

incorta.describe(DataFrame expr)
incorta.head(DataFrame expr, integer number_rows_optional)
incorta.printSchema(DataFrame expr)
incorta.show(DataFrame expr)
incorta.show_plotly(Figure expr,Height=100, Width=100, **kwargs)

incorta.describe(DataFrame expr)

The function returns statistical details about the specified DataFrame such as:

Count of records
Mean
Standard Deviation
Min
Max

incorta.head(DataFrame expr, integer number_rows_optional)

For the specified DataFrame, the function returns a table of field headers and one row by default. An optional parameter accepts 0 or more rows.

incorta.printSchema(DataFrame expr) For the specified DataFrame, the function returns in tabular form the DataFrame schema and details for each column:

Name
Type
Nullable

incorta.show(DataFrame expr)

The function outputs the results of the specified DataFrame in the results section of the paragraph.

incorta.show_plotly(Figure expr, height=double, width=double, `kwargs`)**

This function requires the installation of plotly.py using pip. The function takes a plotly figure or plot dict (dictionary) and displays the graph as output in the results section. The arguments are:

plot_dic: A plotly plot dict or figure

The optional keyword arguments are:

height: height in pixels of the plot
width: width in pixels of the plot
kwargs: any additional kwargs

incorta_ml, Incorta Machine Learning libraries for PySpark

In the Incorta 4.6 release, incorta_ml is a new Python library for machine linear for PySpark from Incorta.

Python Requirements

The incorta_ml library supports Python 2.7, Python 3.5, Python 3.6, and Python 3.7. Pandas officially supports these versions of Python.

DEPRECATION

Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. A future version of pip will drop support for Python 2.7.

Without Notebook Integration

It is possible to use incorta_ml without Notebook Integration. Some of the Python libraries require the installation of Linux packages.

Linux Packages

Please install the following packages on your Incorta host:

gcc
gcc-c++
python-pip
python-devel

Python 2.7

If using Python 2.7, install the following libraries using pip install:

numpy
plotly
pandas
lime
fbprophet
statsmodels
pyramid-arima

Python 3.5, Python 3.6, and Python 3.7

If using Python 3.5, Python 3.6, or Python 3.7, first upgrade pip:

sudo pip install --upgrade pip

After upgrading pip, install the following libraries using pip install:

numpy
plotly
pandas
lime
fbprophet
statsmodels
pmdarima

Available libraries in incorta_ml

The incorta_ml library offers the following:

Feature Selection
Features Preparation
Model Building
Mode Evaluation

Feature Selection

For a given dataframe, use the select_features class to identify potentially significant features.

Signature

from incorta_ml import select_features
output_df = select_features(input_df, model_name, label_column_name, is_training)

params

input_df: is a Spark dataframe contains feature columns and label column.

model_name: handle to identify the process. It is recommended that the name is same name as the prediction for the feature selection.

label_column_name: name of the label column such as input_df.column and needs to be a two part qualified name.

is_training: specify as True if you want to select features from input_df; and select False if you ran the feature selection algorithm before and you just want to select the same features from another dataframe.

returns

output_df: the dataframe that contains only the selected features and the label.

Features Preparation

With the prepare_features class, you can preprocess features for prediction such as converting a non-numeric column with one-hot encoding.

Signature

from incorta_ml import prepare_features
output_df = prepare_features(input_df, model_name, label_column_name, is_training)

params

input_df: is a Spark dataframe that contains feature columns and label column.

model_name: handle to identify the process. It is recommended that the name is same name as the prediction which this preprocessing is built for.

label_column_name: is the name of the label column as a two part qualified name, input_df.columns.

is_training: specify as True to build the transformation in Spark’s Directed Acyclic Graph (DAG) or if you’re using it for the first time to prepare the training data; specify False to apply the transformation on a testing data set or development data set.

returns

output_df: a dataframe that contains numeric features, transformed features, and labels.

Model Building and prediction

With build_model class, you can build the model, save (persist) it, and return a training dataframe with a prediction column.

Supported Algorithms

Here are the support Apache Spark ML algorithms:

LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
GBTClassifier
MultilayerPerceptronClassifier
LinearSVC
NaiveBayes
LinearRegression
GeneralizedLinearRegression
DecisionTreeRegressor
RandomForestRegressor
GBTRegressor
IsotonicRegression

** Note **

Predict should be called after build model with the same model name.

Signature

from incorta_ml import build_model
output_df = build_model(input_df, model_name, algorithm_name, label_column_name, params, mode="classification")
output_df = predict(input_df, model_name)

params

input_df: is a Spark dataframe contains feature columns and label column, note all features should be numeric.

model_name: A handle to identify the Model

algorithm_name: an Algorithm name which can be any of the Apache Spark supported algorithms:

auto: enables auto mode. In auto mode, multiple algorithms are selected as candidates and best of which is selected, also, there is no need to specify params. For a value specifcy the params, None, or {}.

label_column_name: is the name of the label column as a two part qualified name, input_df.columns.

params: a Dictionary that contains the model params as required by Apache Spark, such as {'regParam': 0.001}

mode: a string that defines the mode type such as classification or regression

returns

output_df: dataframe that contains numeric features, transformed features, and labels.

Model Evaluation

Use the evaluate class to evaluate the model for a given dataframe.

Signature

from incorta_ml import evaluate
output_df=evaluate(input_df, model_name)

params

input_df: is a Spark dataframe contains feature columns and label column. The dataframe schema must match the schema of the training dataframe.

model_name: name of the evaulation model.

returns

output_df: a dataframe that contains two columns ‘metric_name’ and ‘value’ where each row represents a metric and an associated numeric value.

Incorta ML Regression Example

from incorta_ml import *  # imports all the functions
full_data = read("/path/to/winequality-white.csv")  # reads the data

# split the data into training and testing set.
split_df = full_data.randomSplit([0.7,0.3],1)
training_df = split_df[0]
prediction_df = split_df[1]

# build, train and save the model.
build_model(training_df, model_name='wine', algorithm_name='RandomForestRegressor',
                   label_column_name='label', params={"numTrees":10}, mode='regression')
# predict the testing data.
dftest_pred=predict(prediction_df, model_name='wine')
# evaluate the model on testing data
eval_df = evaluate(prediction_df, model_name='wine')

# show the evaluation metrics
eval_df.show()

Incorta ML Classification Example

from incorta_ml import * # imports all the functions
full_data = read("/path/to/iris.csv", "csv") # read the data

# split the data into training and testing set.
split_df = full_data.randomSplit([0.7,0.3],1)
training_df = split_df[0]
prediction_df = split_df[1]

# build, train and save the model.
build_model(training_df, model_name='iris', algorithm_name='RandomForestClassifier',
                   label_column_name='label', params={"numTrees":10}, mode='classification')

# predict the testing data.
dftest_pred=predict(prediction_df, model_name='iris')

# evaluate the model on testing data
eval_df = evaluate(prediction_df, model_name='iris')

# show the evaluation metrics
eval_df.show()

Analyzer

In Incorta 4.6, there are several enhancements to the Analyzer icluding:

Add descriptions to insights
Add variables to insight titles

Insight Descriptions

When editing a given dashboard insight with the Analyzer, you can now optionally specify a description. To create an Insight Description, follow these steps:

For an existing dashboard, for a given insight, in the Actions Menu, select Edit(Pen) or More Options(Kebab) and then Edit.
In the Analyzer, select Click to Edit Insight Description, and enter a description.
In the Action bar, select Done to save.

To view an Insight Description in a Dashboard, follow these steps:

For an existing dashboard, for a given insight, in the insight title, select the Information Icon.
In the Tooltip, view the insight description.

Insight Titles Support Variable References

In this release, you can now reference one or more variables in a dashboard insight title. A referenced variable in an insight title returns the original, initialized value. Supported variables types are:

Presentation Variables
Session Variables
System Variables

To reference a variable in an Insight Title, use the $$ syntax. Here are some examples:

To reference the user system variable, enter $$user.
To reference the current date system variable, enter $$currentDate.
To reference a presentation variable named pvGroupBy, enter `$$pvGroupBy`.
To reference an internal session variable named ivarIsUserInGroupAdmin, enter `$$ivarIsUserInGroupAdmin`.

Additional Performance Enhancements

In the 4.6 release, there are additional performance enhancements and changes:

Query Performance improvements for Aggregate and Group By
Tomcat Server Upgrade

Aggregate and Group By Query Performance Improvements

In this release, there are two enhancements to improve Aggregated and Group By query performance.
The first enhancement address pagination rendering for insight visualization. In this release, only the required rows for are processed instead of all rows in an insight visualization. The default number of rows is 1,000. To disable this default enabled feature, in the engine.properties file, set engine.paginate_aggregated_queries to false.

In certain cases pagination of rows is not applicable such as when an insight is being exported as a CSV file, MS Excel file, Pivot Table insights, or queried from another application over the SQLi Interface. In these cases, the Analytics Service now uses parallel sorting and parallel materialization to improve query performance. To disable this default enabled feature, in the engine.properties file, set engine.parallel_materialize_groups to false.

Tomcat Server Upgrade

In this 4.6 release, Incorta now uses Apache Tomcat V7.0.96.

Previous versions of Incorta used Apache Tomcat V7.0.65.

New Incorta installations will use Apache Tomcat V7.0.96 by default.

When upgrading from a previous version to Incorta 4.6, the upgrade process will install the new Tomcat version, but will preserve the existing user configuration files.

Release Notes 4.6

Release Notes 4.6

Release Highlights

Improved Data Lake and Cloud Data Sources

Notebook Integration for Materialized Views

Incorta ML

Additional Improvements and Enhancements

Cluster Management Console (CMC)

Pause all scheduled jobs

Default Materialized View Application settings

Materialized view application cores

Materialized view application memory

Materialized view application executors

Support for Amazon Simple Email Service (Amazon SES)

Enable the ability to add a Single Sign-On (SSO) user from Incorta Analytics

Warmup Mode for Most Used Dashboard Columns

Incorta Labs

Enable Custom Themes

Dark Theme for Users

Enable Insight View As Menu

Inspector Tool Scheduler

About the Inspector Tool

Download the Inspector Tool Dashboards, Schema, and Business Schema

Import the Inspector Tool Schema

Import the Inspector Tool Business Schema

Import the Inspector Tool Dashboards

Enable the Inspector Tool Scheduler

Notebook Add-on

Notebook Integration

Add-ons

Notebook Integration settings

Enable the Notebook Integration

Start, Stop, and Restart Notebook

Editing the Notebook Port

Post Notebook Integration

Incorta Analytics and Loader Service

Optionally Persist and SQLi Result

sql.spark.persist.level

Data Sources

Oracle Cloud Applications

Google BigQuery Data Source

Salesforce v2 Data Source

Azure Data Lake Storage (ADLS) Gen2 Authentication Support for Service Principal authorization

Data Folders

Upload a Folder of Data Files

Share a top level Data Folder

Delete a Data Folder

Upload load Multiple Files

Create a Data Source with incomplete or invalid details

Support legacy MS Excel file formats as a File System Data Source

Schema

Schema Wizard Supports Directory Selection

Incremental Extracts Using a Timestamp in File Names for a Data Lake table

Preview Data in the Table Editor

Post Extraction Callback with Webhooks

Oracle Database and MySQL Database Data Source support for Chunking by Timestamp

Chunking Performance Enhancement

Schema Isolation Protection

Materialized Views with Notebooks

Create a Materialized View using a Notebook

Edit a Materialized View using a Notebook

Using Notebooks for Materialized Views

Code Execution Language for a Notebook

Edit Notebook dialog

Save the Notebook

Anatomy of a Notebook

Notebook Toolbar

Run all paragraphs

Show/Hide this code

Show/Hide this output

Clear output

Search code

List keyboard shortcuts

Anatomy of a Paragraph

Paragraph Status

Paragraph Commands

Run this Paragraph

Show/Hide editor

Show/Hide this output

Configure the Paragraph