Release Notes 4.6

Release Notes 4.6

Incorta 4.6 introduces several key improvements to the Cluster Management Console, Incorta Loader Service, and the Incorta Analytics Service. This release also includes the incorta_ml machine learning library for PySpark that you can use with or without the Notebook Add-on and the Incorta Labs Notebook integration. In addition, the 4.6 release includes other Incorta Labs offerings such as Enable Custom Themes, and the Inspector Tool Scheduler.

Release Highlights

There are several major features in this release including:

Improved Data Lake and Cloud Data Sources

This release offers improved and expanded connectivity to new data lake data sources including cloud data lakes and cloud applications, along with the ability to specify folders of data files and load them incrementally using a lexicographic (timestamp) naming convention.

Notebook Integration for Materialized Views

Previously, when editing code for a materialized view in a given schema, you had to use a basic editor. In this release, you can enable an Incorta Labs feature for Notebook Integration that allows you to write code (Spark SQL or PySpark) using a Notebook interface for materialized views

Each paragraph in a notebook contains a code section and a result section. You can easily execute your code and view the resulting output within the paragraph. In addition, a notebook can consist of one or more paragraphs for sequential code execution. This new feature allows you to iteratively code and explore your data before saving code for export to the materialized view.

Incorta ML

For your PySpark materialized views, you can now rapidly apply machine learning to your schemas for predictive analytics, time series forecasting, and anomaly detection using the incorta_ml library.

Additional Improvements and Enhancements

  • Simplified Cluster and Tenant Administration in the Cluster Management Console (CMC)
  • New Incorta Labs features including Custom Themes, Notebook Integration, Inspector Tool, and Dark Mode theme
  • Incorta Analytics and Loader Service user interface enhancements, data source enhancements, and performance enhancements

Cluster Management Console (CMC)

The following new configurations are available in the Cluster Management Console (CMC):

To sign in to the CMC, visit your CMC host at one of the following:

  • http://<Public_IP>:6060/cmc
  • http://<Public_DNS>:6060/cmc
  • http://<Private_IP>:6060/cmc
  • http://<Private_DNS>:6060/cmc

The default port for the CMC is 6060. Sign in to the CMC using your administrator username and password.

Pause all scheduled jobs

For the selected Incorta Cluster, you can now enable pausing scheduled jobs in both the default tenant and specific tenant configurations for loading data. Scheduled jobs include schema loads, dashboard, and alerts.

Enable this setting to pause active scheduled schema loads, dashboards, and data alerts. This is helpful when importing or exporting an existing tenant. You can resume active scheduled jobs by disabling this option or manually starting them in the Incorta scheduler.

Here are the steps to enable this option as default tenant configuration:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Data Loading.
  • Enable the Pause Scheduled Jobs setting.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given tenant, select Configure.
  • In the left pane, select Data Loading.
  • Enable the Pause Scheduled Jobs setting.
  • Select Save.

Some SMTP servers such as AWS SES require username/password pairs to be provided. In 4.6, the user has both options either to provide a username and password, or the sender’s email and password.

Default Materialized View Application settings

For the selected Cluster, you can now set Materialized Views default values for Apache Spark Integrations:

  • Materialized view application cores
  • Materialized view application memory
  • Materialized view application executors

The Spark Integrations settings are global to all tenants in a cluster configuration.

Materialized view application cores

The number of CPU cores reserved for use by materialized view. The default value is 1. The allocated cores for all running Spark applications cannot exceed the dedicated cores for the cluster.

Materialized view application memory

The number of gigabytes of maximum memory to use for materialized view. The default is 1 GB. The memory for all Spark applications combined cannot exceed the cluster memory (in gigabytes).

Materialized view application executors

Maximum number of executors that can be spawned by a single materialized view application. Each of the executors will allocate a number of the cores defined in sql.spark.mv.cores, and will consume part of the memory defined in sql.spark.mv.memory. Note that the cores and memory assigned per executor will be equal for each executor, hence the number of executors should be a divisor for each of the following configurations (sql.spark.mv.cores and sql.spark.mv.memory). For example, when you configure an application with cores=4, memory=8, executors=2, that result is that the Spark will spawn 2 executors where each executor consumes 2 cores and 4GB from the cluster).

Here is how you can modify these settings and their default values:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Server Configurations.
  • In the left pane, select Spark Integration.
  • Set the value for a given Materialized view application setting:

    • Materialized view application cores
    • Materialized view application memory
    • Materialized view application executors
  • Select Save.

Support for Amazon Simple Email Service (Amazon SES)

Some SMTP email solutions such as Amazon SES require username/password pairs. In this 4.6 release, you can now configure the SMTP host to use a Sender’s Username Authentication.

To enable this option as default tenant configuration in the CMC, follow these steps:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Email.
  • In the right pane, toggle Sender’s Username Auth to enabled.
  • In System Email Username, enter the SMTP username.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given tenant, select Configure.
  • In the left pane, select Email.
  • In the right pane, toggle Sender’s Username Auth to enabled.
  • In System Email Username, enter the SMTP username.
  • Select Save.

Enable the ability to add a Single Sign-On (SSO) user from Incorta Analytics

You must first configure the Incorta Node hosting the Incorta Analytics service to support Single Sign-On (SSO). Please see the Secure Login Access document for SSO configuration.

For the selected Cluster, you can now enable Single Sign-On as the authentication type for user authentication. When the authentication type is set to SSO, users that belong to the SuperRole or the User Manager role in the Incorta Analytics service have the ability to add new users and set the Profile Authentication Type to SSO. Once set to SSO, a user password is no longer required. It is possible to switch back authentication methods for a given user to Incorta authentication.

Here are the steps to enable this option as default tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Security.
  • In the right pane, for Authentication Type, select SSO from the dropdown.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given tenant, select Configure.
  • In the left pane, select Security.
  • In the right pane, for Authentication Type, select SSO from the dropdown.
  • Select Save.

Warmup Mode for Most Used Dashboard Columns

Warmup Mode affects how the Incorta Analytics service loads schema data into memory when starting up. In this release, there is a new option for Warmup Mode: Most Used Dashboard Columns.

When selected, the Most Used Dashboard Column value makes available a secondary setting: Maximum (%) of Memory Intended for Warmup. The default value is 20%. The minimum value is 0% and the maximum value is 75%.

Here are the steps to enable this option as default tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Advanced.
  • In the right pane, Warmup, select Most Used Dashboard Column from the dropdown.
  • In Maximum (%) of Memory Intended for Warmup, enter a value between 0 and 75.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given Tenant, select Configure.
  • In the left pane, select Advanced.
  • In the right pane, in Warmup, select Most Used Dashboard Column from the dropdown.
  • In Maximum (%) of Memory Intended for Warmup, enter a value between 0 and 75.
  • Select Save.

Incorta Labs

Incorta Labs are experimental features and functionality that Incorta supports for non-production use. As such, some experimental features will potentially become part of an Incorta release and others potentially will be deprecated. Incorta Support will investigate issues with Incorta Labs features.

You can enable various Incorta Labs features in the Cluster Management Console. An Incorta Labs feature may require additional configurations and may require restarting Incorta Services and Incorta Add-ons.

In this release, there are several significant new features in Incorta Labs:

To enable or configure an Incorta Labs feature, you must sign in to the Cluster Management Console (CMC). To sign in to the CMC, visit your CMC host at one of the following:

  • http://<Public_IP>:6060/cmc
  • http://<Public_DNS>:6060/cmc
  • http://<Private_IP>:6060/cmc
  • http://<Private_DNS>:6060/cmc

The default port for the CMC is 6060. Sign in to the CMC using your administrator username and password.

Enable Custom Themes

Enable this feature to let individual users set their default appearance to a Dark Theme (Dark Mode). The Dark Theme is not applicable to all user interfaces in the Analytics Services, including the Analyzer, Schema Designer, Table Editor, and Join Editor.

Here are the steps to enable this option as default tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Enable Custom Themes to enabled.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given Tenant, select Configure.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Enable Custom Themes to enabled.
  • Select Save.

To learn more about enabling custom themes, see User Interface Configurations.

Dark Theme for Users

With Custom Themes enabled, you in the Analytics Service can toggle on or off the Dark Theme. Here are the steps to toggle the Dark Theme in the Analytics Service for a user:

  • To open the Profile Menu, in the Navigation bar, select Profile.
  • In the Profile Menu, select the User.
  • In the Edit User drawer, select the Appearance tab.
  • Enable or disable the Dark Theme toggle.

Enable Insight View As Menu

The Enable Insight View As Menu Incorta Labs feature allows a dashboard consumer to view dashboard insight chart visualization as a table or aggregated table. Here are the steps to enable this option as default tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Enable Insight View As Menu to enabled.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given Tenant, select Configure.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Enable Insight View As Menu to enabled.
  • Select Save.

Once enabled, for a chart visualization insight on a given dashboard, you can select, in the Actions menu, the More Options icon (Kebab), and then open the More Options menu. In the More Options Menu, you can select the View As Table option, and select either Regular or Aggregated. To return back to the insight visualization, in the Actions Menu, select the Return (Rollback) icon.

Inspector Tool Scheduler

In this release, in Incorta Labs, you can enable the Inspector Tool to run on a schedule as a default tenant configuration or for a specific tenant configuration. To view and explore the results of the scheduled job in Incorta Analytics, you can also download and import the related Inspector Tool dashboards, schema, and business schema.

About the Inspector Tool

The Incorta Inspector Tool checks the lineage references of Incorta metadata objects including tables, schemas, business schemas, business schema views, dashboards, and session variables. The Inspector tool also checks for inconsistencies and validation errors such as:

  • An invalid join due to mismatched data types or unsupported data types
  • A join with a missing table
  • A join with a missing column
  • A join on a parent table column that is not a key column
  • A join on a child table with multiple parent tables
  • A join using an invalid formula column
  • A join on a formula column that references columns in two or more schemas
  • A cyclical join between two or more tables such as A > B > C > D > A that can be resolves with a table alias
  • Between two tables, multiple join paths
  • A table enabled for incremental loads but without incremental logic specified
  • A table enabled for incremental loads with incremental logic specified but no key column specified
  • A table with a runtime security filter that references a missing session variable or the session variable has a missing definition
  • An alias table with no existing base reference table
  • An alias table out of sync with the existing base reference table
  • A formula that refers to a column that does not exist
  • A formula that that references columns in two or more schemas
  • A business schema view that references a column in table that does not exist
  • A session variable that references another session variable that does not exist
  • A dashboard that references a missing session variable
  • A dashboard that references a missing table or business schema view column

Download the Inspector Tool Dashboards, Schema, and Business Schema

For a given tenant with the Inspector Tool Scheduler enabled, you need to first download the Inspector Tool Dashboards, Schema, and Business Schema. Here are the steps to download the Inspector Tool Dashboards, Schema, and Business Schema:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Incorta Labs.
  • In the right pane, in the description of the Inspector Tool Scheduler, select the download link.
  • In the Box folder, select the following files to download:

    • dashboards.zip
    • business_schema.zip
    • schema.zip

After successfully downloading the zip files, you must import the schema, business schema, and dashboards into a given tenant.

Import the Inspector Tool Schema

Here are the steps to import the Inspector Tool schema for a given tenant:

  • In the Navigation bar, select Schema.
  • In the Action bar, select + New.
  • In the Add New Menu, select Import Schema.
  • Drag and drop the dashboards.zip file to the Import Schema dialog.
  • In the Import Results dialog, verify the schema name, InspectorMetadata, and select Close.

The InspectorMetadata schema contains the following tables:

  • BUSINESS_SCHEMA_VIEWS
  • JOINS_DETAILS
  • LINEAGE_REPORT
  • MV_REFERENCED_TABLES
  • SCHEMA_TABLES
  • VALIDATION

Import the Inspector Tool Business Schema

Here are the steps to import the Inspector Tool business schema for a given tenant:

  • In the Navigation bar, select Business Schema.
  • In the Action bar, select + New.
  • In the Add New Menu, select Import Business Schema.
  • Drag and drop the business_schema.zip file to the Import Business Schema dialog.
  • In the Import Results dialog, verify the schema name, incortaInspector, and select Close.

The incortaInspector business schema contains the following folders and views:

  • TenantHierarchy (folder)

    • DashboardLineage
    • Schemas
    • Joins
    • BusinessSchemas
    • MVs
  • Validation (view)

Import the Inspector Tool Dashboards

Here are the steps to import the Inspector Tool dashboards for a given tenant:

  • In the Navigation bar, select Content.
  • In the Action bar, select + New.
  • In the Add New Menu, select Import Folder/Dashboard.
  • Drag and drop the dashboards.zip file to the Import Folder/Dashboard dialog.

In the InspectorTool folder, there are several Inspector Tool dashboards:

  • 0- Run status
  • 1- Validation UseCases
  • 2- Unused Entities
  • 3- Schemas Details
  • 4- Dashboards Lineage Summary
  • 5- Tables Used in Business Views
  • 6- Tables Used In Materialized Views

Enable the Inspector Tool Scheduler

Having successfully imported the Inspector Tool Dashboards, Schema, and Business Schema, you can now enable the Inspector Tool Scheduler in the Cluster Management Console as both a default tenant configuration or a tenant configuration.

Here are the steps to enable this option as default tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Inspector Tool Scheduler to enabled.
  • Specify the schedule.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given tenant, select Configure.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Inspector Tool Scheduler to enabled.
  • Specify the schedule.
  • Select Save.

Notebook Add-on

The Notebook Integration Incorta Labs feature requires that Apache Spark is running and properly configured for the Incorta Cluster instance.

A notebook is an interactive environment for creating a materialized view in a given schema. As an interactive notebook environment, you can execute individual paragraphs, view a table of query results, and visualize results as a bar chart, pie chart, area chart, line chart, or a scatter chart.

In Incorta 4.6, a notebook defined materialized view supports two interoperable languages, SQL and Python. This means that one paragraph can be in SQL and another Python.

Apache Spark executes all materialized views. Apache Spark natively runs Spark SQL queries using columnar data stored as Apache Parquet files in Shared Storage (Staging).

Notebook Integration

Before using a Notebook to create a materialized view in a schema, you must first integrate the Notebook into an Incorta Cluster. Notebook Integration requires the completion of several key tasks in the CMC:

  • Create Notebook Add-on service
  • Set the Notebook Integration properties in Server Configurations
  • Enable the Notebook Integration
  • Start the Notebook service

There are several requirements for implementing the Incorta Labs Notebook Integration:

  • Supported Linux Operating System
  • Apache Spark 2.4.3 must already be configured for the Incorta Cluster and must be running
  • The Incorta Node hosting the Notebook Add-on requires Python 2.7, Python 3.6, or Python 3.7. Python 3.8 is not yet supported.
  • On the Incorta Node hosting the notebook, the default port 5500 must be open or the configured port must be open.
Add-ons

An Incorta Cluster can only have a single Notebook Add-on. You can install the Notebook Add-on during a new installation or after an installation.

There are two types of cluster installations, Single Host which is a Standalone instance using the Typical installation method and Multi-host which requires a Custom installation. Both cluster typologies are applicable to Incorta Notebooks.

During a Single Host (Typical) Installation

Here are the steps to configure and install a Notebook Add-on during a Single Host (typical) Installation:

  • In the Configuration Wizard, for Add-ons, specify the Notebook port value (the default 5500).
  • To continue the Configuration Review, select Next.
  • Select Create.

After a Single Host (Typical) Installation

Here are the steps to configure and install a Notebook Add-on after a Single Host (Typical) or Multi-host (Custom) installation:

  • In the Navigation bar, select Nodes.
  • In the nodes list, select the localNode.
  • In the canvas, select the Add-ons tab.
  • To create a Notebook, in the Add-ons header, select + (Add).
  • In the Create a new notebook dialog, enter the Port number. The default value is 5500.
  • Select Save.

Multi-host (Custom) Installation

Here are the steps to configure and install a Notebook Add-on for a custom installation:

  • In the Navigation bar, select Nodes.
  • In the nodes list, select an Incorta Node.
  • In the canvas, select the Add-ons tab.
  • To create a Notebook, in the Add-ons header, select + (Add).
  • In the Create a new notebook dialog, enter the Notebook Name and the Port number. The default value is 5500.
  • Select Save.
Notebook Integration settings

The Notebook Integrations settings are global to all tenants in a cluster configuration. Before starting the Notebook, you must:

  • Set the Notebook Integration properties in Server Configurations
  • Enable the Notebook Integration

For the selected Cluster, you can set the default values for the Notebook integration:

  • Notebook Max Cores
  • Notebook Max Memory

Notebook Max Cores Maximum amount of memory to use for all notebook executors.

Notebook Max Memory Maximum amount of memory to use for all notebook executors, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g). Here is how you can modify these settings and their default values:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Server Configurations.
  • In the left pane, select Notebook Integration.
  • Set the value for a given Notebook setting:

    • Notebook Max Cores
    • Notebook Max Memory
  • Select Save.
Enable the Notebook Integration

After Notebook Integration properties are set, you must enable the Incorta Labs Notebook feature. Here are the steps to enable this option as default tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Notebook Integration to enabled.
  • Select Save.

Here are the steps to enable this option for a specific tenant configuration in the CMC:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select the Tenants tab.
  • In the Tenant list, for the given Tenant, select Configure.
  • In the left pane, select Incorta Labs.
  • In the right pane, toggle Notebook Integration to enabled.
  • Select Save.
Start, Stop, and Restart Notebook

Here are the steps to start, stop, and restart a Notebook:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Add-ons.
  • In the nodes list, select the Notebook name.
  • In Notebook details, select Restart, Stop, or Start.
Editing the Notebook Port

You must restart the notebook after changing the Notebook port as follows:

  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Add-ons.
  • In the nodes list, select the Notebook name.
  • In Notebook details, in the title, select Edit.
  • Change the Port value.
  • Select Update.

Post Notebook Integration

Once configured and running, you can create and edit a materialized view in a given schema with a Notebook.

Incorta Analytics and Loader Service

The 4.6 release introduces several key improvements to the Incorta Analytics and Loader Services such as:

Optionally Persist and SQLi Result

In Apache Spark, submitted jobs often persist a dataframe so as to preserve any data transformation, calculations, or aggregations for future tasks in a job.

sql.spark.persist.level

In this 4.6 release, the SQLi interface determines if an executed task will persist, and if that is the case, control how to persist the dataframe. There are three valid values for the sql.spark.persist.level property:

  • never: Indicates that the dataframe will never persist. Use this setting value for diagnosis and troubleshooting
  • always: Indicates that the dataframe will always persist. This is the default value.
  • query: Indicates that the SQLi interface will check the Apache Spark query plan. If there are simple task stages in the query plan such access access a single table or applying a simple filter, then the dataframe will not persist. However, if the Spark query plan is complex and has, for example, numerous shuffles and broadcasts, the dataframe will persist.

Data Sources

In the 4.6 release, the user interface for adding a new data source is new. In addition, new data source choices exists, including choices for:

  • Oracle Cloud Applications
  • Google BigQuery
  • Salesforce v2

Oracle Cloud Applications

The Oracle Cloud Applications Connector extracts data from Web Cloud Content (WCC) that the Oracle Business Intelligence Cloud Connector Console compresses in comma separated value (CSV) file format.

Here are the steps to add an Oracle Cloud Applications data source in the Analytics Service:

  • In the Navigation bar, select Data.
  • In the Actions bar, select + New, then select Add Data.
  • In the Choose a Data Source dialog, in Application, select Oracle Cloud Applications.
  • In the New Data Source dialog, specify the:

    • Data Source Name
    • Username
    • Password
    • Oracle Cloud Applications URL
    • Root Query Text
    • Data Type Discovery Policy
    • File Name Pattern
    • File Criteria - Last Modified Timestamp
  • To test, select Test Connection.
  • Select Ok to save your changes.

The Data Type Discovery Policy defines the Metadata Definition files. These files must be uploaded first to Incorta data files, and must have *.csv extension.

The File Criteria - Last Modified Timestamp property acts as a time filter for all the results concerning this data source. For example, >= ‘2019-05-31 15:30’ will return all the files created or modified after this date.

Google BigQuery Data Source

To analyze data housed in Google Storage, first create a BigQuery Data Source. Before implementing a BigQuery data source, you must first download and configure the BigQuery driver for Incorta. The driver is in a JAR file. The BigQuery JAR file must exist in both the CMC and the Incorta Services installation path.

Here are the steps to create a Google BigQuery data source in the Analytics Service:

  • In the Navigation bar, select Data.
  • In the Actions bar, select + New, then select Add Data.
  • In the Choose a Data Source dialog, select BigQuery.
  • In the New Data Source dialog, specify the:

    • Data Source Name
    • Username
    • Password
    • Project ID
    • Path of the json file key downloaded from google cloud service accounts
  • To test, select Test Connection.
  • Select Ok to save your changes.

There are no changes for creating a schema using a BigQuery data source from previous versions of Incorta.

Salesforce v2 Data Source

To analyze data in Salesforce, first create a Salesforce (v2) data source. The Salesforce v2 data source connector uses a REST API interface and overcome limitations with the Salesforce data source connector, version 1, which employs a SOAP API interface.

Before implementing a Salesforce data source, you must first download and configure the Salesforce v2 driver for Incorta. The driver is in a JAR file. The Salesforce v2 JAR file must exist in both the CMC and the Incorta Services installation path.

Here are the steps to create a Salesforce v2 data source in the Analytics Service:

  • In the Navigation bar, select Data.
  • In the Actions bar, select + New, then select Add Data.
  • In the Choose a Data Source dialog, in Other, select Salesforce (v2).
  • In the New Data Source dialog, specify the:

    • Data Source Name
    • Username for Salesforce
    • Password for Salesforce
    • Token for Salesforce Authentication
    • Optionally specify a Proxy:

      • Proxy Host
      • Proxy Port
      • Proxy Username
      • Proxy Password
  • To test, select Test Connection.
  • Select Ok to save your changes.

There are no changes for creating a schema using a Salesforce v2 data source from previous versions of Incorta.

Azure Data Lake Storage (ADLS) Gen2 Authentication Support for Service Principal authorization

For an Azure Data Lake Storage (ADLS) Gen2 data source, you can now specify Service Principal as an Authentication Type.

An Azure Active Directory service principal is an identity for an application that needs to access or modify resources using Role-Based Access Control (RBAC).

To learn more about creating an Azure Active Directory Service Principal in your Azure Portal, visit How to Create a Service Principal

Your Azure Portal contains the required details for this configuration:

  • Client ID
  • Client Secret Key
  • Tenant ID

To configure authentication support Service Principal for authorization, follow these steps:

  • In the Navigation bar, select Data.
  • In the Actions bar, select + New, then select Add Data.
  • In the Choose a Data Source dialog, in Data Lake, select Data Lake - Azure Gen2.
  • In the New Data Source dialog, specify the following:

    • Data Source Name
    • Authentication Type = Service Principal
    • Client ID
    • Client Secret Key
    • Tenant ID
    • Directory
  • Select Ok to save your changes.

Data Folders

In 4.6 release, Incorta Analytics users can now create, upload, share, and delete Data Folders. A Data Folder can contain one or more Data Files and Data Folders. Here are the steps to Create a Data Folder:

  • In the Navigation bar, select Data.
  • In the Action bar, select + New, then select Create Folder.
  • In the Add Folder dialog, enter the Folder name.
  • The new Data Folder appears in the Local Data File tab.
Upload a Folder of Data Files

In 4.6 release, you can upload a Folder local to your machine that contains Data Files into Incorta. Incorta uploads the entire folder hierarchy of subfolders and files in original form. Incorta only uploads supported data file types: CSV, TSV, TXT, XLS, and XSLT. Incorta ignores any empty folders as well as duplicate files unless the Overwrite option is enabled. Here are the steps to upload multiple files:

  • In the Navigation bar, select Data.
  • In the Actions bar, select + New, then select Add Data.
  • In the Choose a Data Source dialog, in Data Files, select Upload Data Folder.
  • In the Upload Data Folder dialog, in Upload Options, optionally select Overwrite existing file.
  • In the Upload Data Folder dialog, drag and drop a Folder.
  • Select Upload.
Share a top level Data Folder

You can only share a top level Data Folder. All child items — folders and data files — inherit the same shared access rights.

Delete a Data Folder

You can only delete a folder you have Edit permissions for the folder. Deleting a folder also deletes child data folders and data files.

Upload load Multiple Files

In 4.6 release, Incorta Analytics users can upload one or more Data Files to the Local Data Files. Duplicate files are ignored unless the Overwrite option is enabled. Here are the steps to upload multiple files:

  • In the Navigation bar, select Data.
  • In the Actions bar, select + New, then select Add Data.
  • In the Choose a Data Source dialog, in Data Files, select Upload Data File.
  • In the Upload Data File dialog, in Upload Options, optionally select Overwrite existing file.
  • In the Upload Data File dialog, drag and drop one or more files.
  • Select Upload.

Create a Data Source with incomplete or invalid details

In this release, you can now create or edit a data source with incomplete or invalid details. An Error dialog simply reports that some fields are invalid. To save the invalid settings, select Save anyway.

Support legacy MS Excel file formats as a File System Data Source

Incorta now supports more Microsoft Excel file formats:

  • Excel Workbook (.xlsx)
  • Excel 97-2003 Workbook (*.xls)
  • Microsoft Excel 5.0/95 Workbook (*.xls)

Schema

In the Incorta 4.6 release, there are several enhancements for Schema optimization and configuration such as:

  • Directory Selection for Data Lake folders in the Schema Wizard
  • Incremental extracts for a Data Lake table that uses timestamp file naming
  • Preview of data in the Table Editor
  • Post Extraction Callback with Webhooks

Schema Wizard Supports Directory Selection

In this release, the Schema Wizard supports the selection of a folder directory. The requirement is that the folder directory exists as either a Local Data Files folder or a Data Lake data source folder. The Schema Wizard automatically configures the table data source to use a Directory, including all Subdirectories Files and Union Files. Follow these steps to select a folder directory for Local Data Files folder using the Schema Wizard:

  • In the Navigation bar, select the Schema tab.
  • In the Action bar, select + New.
  • In the Add New Menu, select Schema Wizard.
  • In the Add Schema Wizard, in (1) Choose a Source, enter a unique Schema Name.
  • In Select a Datasource, select LocalFiles.
  • Optionally enter a Schema description.
  • Select Next.
  • In (2) Manage Tables, in the Selection Panel, select the folder directory.
  • Verify the columns.
  • Click Next.
  • In (3) Finalize, leave checked the “Create joins between selected tables if foreign key relationships are detected” checkbox.
  • Click Finish.

Incremental Extracts Using a Timestamp in File Names for a Data Lake table

In this release, for a table in a schema that uses a Data Lake data source, you can adopt an incremental data loading strategy that relies on a timestamp in the file name itself. The requirement is that all files in the Data Lake source, such as a S3 bucket, employ a consistent naming convention that includes a timestamp format for lexicographic comparison. The file name can be the timestamp itself or the timestamp as the file name suffix. The supported timestamp formats are:

  • yyyy-MM-dd
  • dd.MM.yyyy
  • dd-MMM-yy
  • dd-MMM-yyy
  • yyyy-MM-dd HH.mm.ss
  • Unix Epoch (seconds)
  • Unix Epoch (milliseconds)

Here are examples of several Data Lake files in a AWS S3 bucket that support incremental loads scheduled for every 30 minutes with naming convention using the “yyyy-MM-dd HH.mm.ss” timestamp format:

  • transactions_2020-01-21 09.28.01.csv
  • transactions_2020-01-21 10.08.10.csv
  • transactions_2020-01-21 10.28.15.csv

Incorta will ignore files with a non-conforming timestamp file name.

To enable and configure an incremental extract for a Data Lake table using a timestamp file name, follow these steps:

  • For a given schema, in the Schema Designer, select the Data Lake table, and open the schema Table Editor.
  • In the Table Editor, in the Summary section, select the Table Data Source.
  • In the Data Source dialog, toggle Incremental to enabled.
  • In** Incremental Extract Using, select **Timestamp in File Name.
  • In Timestamp format in file name, select a timestamp format.
  • In Directory Path, select the directory path relative to the root of the Data Lake data source path.
  • Select Add.

Preview Data in the Table Editor

In this release, you can now preview table data from the Table Editor from the Columns section. The Preview dialog shows sample data from the table along with column level statistics including MINVALUE, MAXVALUE, and #NULLS.

Here are the steps view to Preview Data:

  • For a given schema, in the Schema Designer, select a table to open the Schema Table Editor.
  • In the Table Editor, in the Columns section, select Preview data.
  • In the Preview dialog, review the columns statistics and sample rows.
  • To close the Preview dialog, select X.

Preview Data also supports previewing data for Data Lake data source tables that are configured as Remote tables.

Post Extraction Callback with Webhooks

In this release, for a given table in a schema, you can specify a post extraction callback to an external service, application, or endpoint using Webhooks. Incorta supplies the following extraction parameters for the Webhook callback:

  • extractionDuration
  • rejectedRows
  • loadType
  • extractedRows
  • state
  • schemaName
  • extractStart
  • tableName

Here are the steps to enable and configure a Callback for a table data source:

  • For a given schema, in the Schema Designer, select a specific Table, and open the Schema Table Editor.
  • In the Table Editor, in the Summary section, select a Table Data Source.
  • In the Data Source dialog, enable Callback.
  • With Callback enabled, enter the Callback URL.
  • Select Save.

You can verify a post extraction callback with these steps:

  • In a separate browser tab or window, visit http://webhook.site.
  • Select Copy to clipboard.
  • For the table data source, in the Data Source dialog, enable the Callback and paste the copied webhook URL into the Callback URL textbox.
  • In the Data Source dialog, select Save.
  • In Schema Designer, for the given table, in More Options, select Load Table or Load Staging.
  • Return to the webhook URL in the browser to view the result in Form Values.

Oracle Database and MySQL Database Data Source support for Chunking by Timestamp

Incorta chunking is a table data source configuration that allows for parallel data extraction. The parallel execution significantly helps extract rows from very large tables.

In this release, for both Oracle and MySQL data source tables, you can now select the Chunking Method the By Timestamp option.

When selected, you must specify the following:

  • Order Column, which must be a Date or Timestamp data type and determines how to order the table before the chunking extraction
  • Chunk Period, which determines the boundary of the chunks as either Daily, Weekly, Monthly, or Annually.

In addition to the Order Column and Chunk Period, there are two optional properties:

  • Upper Bound, serves as the end date or timestamp for the Order Column order and the value needs to be in the format, “yyyy-MM-dd HH:mm:ss.SSS”
  • Lower Bound, serves as the start date or timestamp for the Order Column order and the value needs to be in the format, “yyyy-MM-dd HH:mm:ss.SSS”

Chunking Performance Enhancement

Chunking threads now honors the connection pool size and will not exceed the connection pool. If the number of chunks is more than the connection pool, the Incorta Loader Service will queue the chunks. In addition, the timeout for chunk fetching is unlimited.

Schema Isolation Protection

While a schema is loading data or is scheduled to load data, in this release, Incorta prevents schema deletes, imports or updates.

Materialized Views with Notebooks

Notebook Integration is an Incorta Labs feature and requires additional configuration in the CMC. Please refer to the Notebook Add-on section in this release notes for more details about Notebook integration and configuration.

The Incorta Labs Notebook integration supports the following actions:

  • Create a Materialized View using a Notebook
  • Edit an existing Materialized View using a Notebook
Create a Materialized View using a Notebook

The selected Language is the language that Notebook will export to the materialized view. To create a materialized view using a Notebook, follow these steps:

  • For a given schema, in the Actions bar, select + New.
  • In the Add New menu, select Materialized View.
  • In the Data Source dialog, in Language select SQL or Python.
  • In Script, select Edit in Notebook.
  • In the Edit Notebook (selected language) dialog, in at least one Paragraph, in the paragraph Code Section, enter the execution code.
  • Optionally run the paragraph or the notebook.
  • To exit the Edit Notebook dialog, select Done.
  • In the Data Source dialog, optionally select Add Property to configure specific Apache Spark properties for the materialized view.
  • Select Save.
Edit a Materialized View using a Notebook

To create a materialized view using a Notebook, follow these steps:

  • For a given schema, in the Schema Designer, select the materialized view table, and open the Schema Table Editor.
  • In the Table Editor, in the Summary section, select the Table Data Source.
  • In the Data Source dialog, in Script, select the textbox or open dialog icon.
  • In the Edit Notebook dialog, in at least one Paragraph, in the paragraph Code Section, edit the execution code.
  • Optionally run the paragraph or the notebook.
  • To exit the Edit Notebook dialog, select Done.
  • In the Data Source dialog, optionally select Add Property to configure specific Apache Spark properties for the materialized view.
  • Select Save.

Using Notebooks for Materialized Views

By design, a notebook is an interactive environment that allows you to explore, manipulate, and transform data iteratively and interactively. A notebook consists of one or more paragraphs. A paragraph consists of a code section and a result section. In the code section, you can use a language specific editor to write either PySpark or SQL code. You can execute code in the code section using paragraph commands. When there are executed results, you can view the output in the result section of the paragraph.

When editing a notebook, you can run a specific paragraph or all notebook paragraphs in the notebook.

The Notebook Add-on service runs as an application in Apache Spark. The Notebook Add-on service creates a notebook application in Apache Spark that manages the paragraph execution request. When running more than one paragraph, the Notebook Add-on service application processes each paragraph sequentially: when the first paragraph completes, the second is started.

Code Execution Language for a Notebook

When creating a materialized view, in the Data Source dialog, you must select a Language. The choices are SQL or Python. SQL represents the execution of SQL using the Spark SQL library. Python represents the execution of PySpark, which is the Python API for Spark.

Edit Notebook dialog

In the dialog title, the Edit Notebook dialog specifies the notebook language for export to the materialized view. The Edit Notebook dialog contains the notebook layout. The notebook layout consists of a toolbar bar and one or more interactive paragraphs.

Save the Notebook

To save your changes to a notebook, in the Edit Notebook dialog, select Done. When you save your changes in the Edit Notebook dialog, two events occur:

  • Incorta saves all the changes to the interactive notebook as a notebook file internally
  • From the notebook, Incorta exports only the specified Language paragraphs to the materialized view.

Anatomy of a Notebook

A notebook consists of the Notebook Toolbar and one or more paragraphs. You can use the Toolbar and related keyboard shortcuts to interact with the notebook environment.

Notebook Toolbar

In the notebook toolbar, you can:

  • Run all paragraphs
  • Show/Hide this code
  • Show/Hide this output
  • Clear output
  • Search code
  • List keyboard shortcuts
Run all paragraphs

The Notebook Add-on service runs as an application in Apache Spark. When you select Run all paragraphs, the notebook submits each paragraph in sequential order to the Notebook Add-on service. The Notebook Add-on service creates a notebook application in Apache Spark that manages the execution requests. Each paragraph is processed sequentially: when the first paragraph completes, the second is started.

Show/Hide this code

Use the Show/Hide this code notebook toolbar command to toggle the visibility of all paragraph code sections in the notebook.

Show/Hide this output

Use the Show/Hide this output notebook toolbar command to toggle the visibility of all paragraph result sections in the notebook.

Clear output

Use the Clear output notebook toolbar command to clear all paragraph result sections in the notebook. A dialog will confirm the command request.

Search code

Use the Search code toolbar command to open the Find and Replace dialog. The scope of Search is for paragraph code sections only. You can navigate between search term occurrences, replace a single search term occurrence with a new term, or replace all search term occurrences with a new term.

List keyboard shortcuts

Use the List keyboard shortcuts toolbar command to open the Keyboard shortcuts dialog. The Keyboard shortcuts dialog contains the following:

Note(book) Keyboard Shortcuts

Command Keystroke
Run paragraph Shift + Enter
Run all above/below paragraphs Ctrl + Shift + Enter
Cancel Ctrl + Option + C
Move cursor Up Ctrl + P
Move cursor Down Ctrl + N
Remove paragraph Ctrl + Option + D
Insert new paragraph above Ctrl + Option + A
Insert new paragraph below Ctrl + Option + B
Insert copy of paragraph below Ctrl + Shift + C
Move paragraph Up Ctrl + Option + K
Move paragraph Down Ctrl + Option + J
Enable/Disable run paragraph Ctrl + Option + R
Toggle output Ctrl + Option + O
Toggle editor Ctrl + Option + E
Toggle line number Ctrl + Option + M
Toggle title Ctrl + Option + T
Clear output Ctrl + Option + L
Link this paragraph Ctrl + Option + W
Reduce paragraph width Ctrl + Shift + -
Increase paragraph width Ctrl + Shift + +

Editor Keyboard Shortcuts

Command Keystroke
Auto-completion Ctrl + .
Cut the line Ctrl + K
Paste the line Ctrl + Y
Search inside the code Ctrl + S
Move cursor to the beginning Ctrl + A
Move cursor at the end Ctrl + E
Find in code Ctrl + Option + F1

Anatomy of a Paragraph

By design, a notebook encourages you to explore data iteratively and interactively using the construct of a paragraph. A paragraph consists of:

  • Paragraph Status
  • Paragraph Commands
  • Paragraph Title
  • Code Section
  • Result Section
Paragraph Status

When executing the code section, you can view the various statuses of a given paragraph. The statuses are:

  • Ready
  • Pending
  • Running
  • Error
  • Finished
Paragraph Commands

Paragraph commands include the following:

  • Run this paragraph (Shift + Enter)
  • Show/Hide editor (Ctrl + Option + E)
  • Show/Hide output (Ctrl + Option + O)
  • Configure the paragraph
Run this Paragraph

The Notebook Add-on service runs as an application in Apache Spark. When you select Run this paragraph, the notebook submits the paragraph code to the Notebook Add-on service which in turn manages the code execution in Apache Spark.

Show/Hide editor

Use the Show/Hide editor paragraph command to toggle the visibility of the paragraph code section.

Show/Hide this output

Use the Show/Hide output paragraph command to toggle the visibility of the paragraph result section.

Configure the Paragraph

You can use the Configuration Menu or in some cases, use keystroke combination to configure the paragraph. Here are the options and related keystrokes for configuring a paragraph:

  • Width
  • Font size
  • Move down (Ctrl + Option + J)
  • Insert new (Ctrl + Shift + B)
  • Run all below (Ctrl + Shift + Enter)
  • Clone paragraph (Ctrl + Option + C)
  • Show title (Ctrl + Option + T)
  • Show line numbers (Ctrl + Option + M)
  • Disable run (Ctrl + Option + R)
  • Link this paragraph (Ctrl + Option + W)
  • Clear output (Ctrl + Option + L)
  • Remove (Ctrl + Option + D)
Paragraph Title

Although not required, a paragraph can have a descriptive title. Use the Toggle title keystroke (Ctrl + Option + T) to show or hide the paragraph title.

Code Section

In a Code Section, you can:

  • Specify the language of execution, such as SQL (%sql) or Python (%pyspark)
  • Enter the code for execution using the editor

Although a notebook may contain both SQL and PySpark (Python for Spark) paragraphs, the Notebook exports only the code that is specific to the Data Source dialog language selection to the materialized view.

Code Editor

The Code Editor contains context-aware code completion. As you type, code completion will offer suggestions in a menu. As an example, for PySpark code, the code completion menu will offer choices for Python keywords, local variable references, and Incorta schema related objects.

Result Section

The result section contains the output from the execution of the code section. You can view the output of an executed paragraph. In the Action bar for the paragraph output you can view the output as a:

  • Table
  • Bar Chart
  • Pie Chart
  • Area Chart
  • Line Chart
  • Scatter Chart

The Result Section truncates output by default to 102400 bytes. Each viewing option has specific interactions and configurable settings.

Table Settings

In the output toolbar, select settings to view the Table Options:

  • useFilter
  • showPagination

useFilter

When selected, you can interactively filter rows by specify a column filter in the filter textbox of the column header.

showPagination

When selected, you can interactively page through rows using the footer pagination control. The control allows you to go to the first page, navigate to the next page, navigate back a page, or navigate to the end pages. The footer pagination control also allows you to specify the number of items (rows) per page: 25, 50, 100, 250, or 1000.

For a table, in the Column Header, you can open the More Options menu to:

  • Sort Ascending
  • Sort Descending
  • Hide Column
  • View the selected data Type
  • Group Table by the first column

In addition, you can also select columns in a Table to view.

Bar Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

  • keys
  • groups
  • values

Keys

Specify one or more column fields that define uniqueness.

Groups

Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For a Bar Chart, additional settings include xAxis selections, optionally Grouped or Stacked bars, and legend toggle interactions. The xAxis selections are:

  • Default
  • Rotate
  • Hide
Pie Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

  • keys
  • groups
  • values

Keys

Specify one or more column fields that define uniqueness.

Groups

Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For a Pie Chart, additional settings include legend toggle interactions.

Area Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

  • keys
  • groups
  • values

Keys

Specify one or more column fields that define uniqueness.

Groups Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For an Area Chart, additional settings include xAxis selections, optionally Stacked, Stream, Expand area definitions, and legend toggle interactions. The xAxis selections are:

  • Default
  • Rotate
  • Hide
Line Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

  • keys
  • groups
  • values

Keys

Specify one or more column fields that define uniqueness.

Groups

Specify one or more column fields for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify one or more column fields for aggregation. Select a field to change the aggregation type (sum, count, avg, min, max) in the menu.

For a Line Chart, additional settings include force Y to 0, zoom, Date format, xAxis selections, and legend toggle interactions. The xAxis selections are:

  • Default
  • Rotate
  • Hide

Zoom allows you to select a specific range of values on the xAxis.

Scatter Chart Settings

In the output toolbar, select settings to view the Available Fields from the result output. You can drag and drop column fields to the following:

  • xAxis
  • yAxis
  • group
  • size

xAxis

Specify a column field for the xAxis.

yAxis

Specify a column field for the yAxis.

group

Specify a column field for grouping. For example, you can group by Year. You can toggle the values in the legend.

Values

Specify a column field for aggregation.

For a Scatter Chart, additional settings include legend toggle interactions.

Incorta functions for PySpark Code Sections

For a PySpark code section, you can use the following Incorta helper functions:

  • incorta.describe(DataFrame expr)
  • incorta.head(DataFrame expr, integer number_rows_optional)
  • incorta.printSchema(DataFrame expr)
  • incorta.show(DataFrame expr)
  • incorta.show_plotly(Figure expr,Height=100, Width=100, **kwargs)

incorta.describe(DataFrame expr)

The function returns statistical details about the specified DataFrame such as:

  • Count of records
  • Mean
  • Standard Deviation
  • Min
  • Max

incorta.head(DataFrame expr, integer number_rows_optional)

For the specified DataFrame, the function returns a table of field headers and one row by default. An optional parameter accepts 0 or more rows.

incorta.printSchema(DataFrame expr) For the specified DataFrame, the function returns in tabular form the DataFrame schema and details for each column:

  • Name
  • Type
  • Nullable

incorta.show(DataFrame expr)

The function outputs the results of the specified DataFrame in the results section of the paragraph.

incorta.show_plotly(Figure expr, height=double, width=double, `kwargs`)**

This function requires the installation of plotly.py using pip. The function takes a plotly figure or plot dict (dictionary) and displays the graph as output in the results section. The arguments are:

  • plot_dic: A plotly plot dict or figure

The optional keyword arguments are:

  • height: height in pixels of the plot
  • width: width in pixels of the plot
  • kwargs: any additional kwargs

incorta_ml, Incorta Machine Learning libraries for PySpark

In the Incorta 4.6 release, incorta_ml is a new Python library for machine linear for PySpark from Incorta.

Python Requirements

The incorta_ml library supports Python 2.7, Python 3.5, Python 3.6, and Python 3.7. Pandas officially supports these versions of Python.

DEPRECATION

Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. A future version of pip will drop support for Python 2.7.

Without Notebook Integration

It is possible to use incorta_ml without Notebook Integration. Some of the Python libraries require the installation of Linux packages.

Linux Packages

Please install the following packages on your Incorta host:

  • gcc
  • gcc-c++
  • python-pip
  • python-devel
Python 2.7

If using Python 2.7, install the following libraries using pip install:

  • numpy
  • plotly
  • pandas
  • lime
  • fbprophet
  • statsmodels
  • pyramid-arima
Python 3.5, Python 3.6, and Python 3.7

If using Python 3.5, Python 3.6, or Python 3.7, first upgrade pip:

sudo pip install --upgrade pip

After upgrading pip, install the following libraries using pip install:

  • numpy
  • plotly
  • pandas
  • lime
  • fbprophet
  • statsmodels
  • pmdarima

Available libraries in incorta_ml

The incorta_ml library offers the following:

  • Feature Selection
  • Features Preparation
  • Model Building
  • Mode Evaluation

Feature Selection

For a given dataframe, use the select_features class to identify potentially significant features.

Signature
from incorta_ml import select_features
output_df = select_features(input_df, model_name, label_column_name, is_training)
params

input_df: is a Spark dataframe contains feature columns and label column.

model_name: handle to identify the process. It is recommended that the name is same name as the prediction for the feature selection.

label_column_name: name of the label column such as input_df.column and needs to be a two part qualified name.

is_training: specify as True if you want to select features from input_df; and select False if you ran the feature selection algorithm before and you just want to select the same features from another dataframe.

returns

output_df: the dataframe that contains only the selected features and the label.

Features Preparation

With the prepare_features class, you can preprocess features for prediction such as converting a non-numeric column with one-hot encoding.

Signature
from incorta_ml import prepare_features
output_df = prepare_features(input_df, model_name, label_column_name, is_training)
params

input_df: is a Spark dataframe that contains feature columns and label column.

model_name: handle to identify the process. It is recommended that the name is same name as the prediction which this preprocessing is built for.

label_column_name: is the name of the label column as a two part qualified name, input_df.columns.

is_training: specify as True to build the transformation in Spark’s Directed Acyclic Graph (DAG) or if you’re using it for the first time to prepare the training data; specify False to apply the transformation on a testing data set or development data set.

returns

output_df: a dataframe that contains numeric features, transformed features, and labels.

Model Building and prediction

With build_model class, you can build the model, save (persist) it, and return a training dataframe with a prediction column.

Supported Algorithms

Here are the support Apache Spark ML algorithms:

  • LogisticRegression
  • DecisionTreeClassifier
  • RandomForestClassifier
  • GBTClassifier
  • MultilayerPerceptronClassifier
  • LinearSVC
  • NaiveBayes
  • LinearRegression
  • GeneralizedLinearRegression
  • DecisionTreeRegressor
  • RandomForestRegressor
  • GBTRegressor
  • IsotonicRegression

** Note **

Predict should be called after build model with the same model name.

Signature
from incorta_ml import build_model
output_df = build_model(input_df, model_name, algorithm_name, label_column_name, params, mode="classification")
output_df = predict(input_df, model_name)
params

input_df: is a Spark dataframe contains feature columns and label column, note all features should be numeric.

model_name: A handle to identify the Model

algorithm_name: an Algorithm name which can be any of the Apache Spark supported algorithms:

auto: enables auto mode. In auto mode, multiple algorithms are selected as candidates and best of which is selected, also, there is no need to specify params. For a value specifcy the params, None, or {}.

label_column_name: is the name of the label column as a two part qualified name, input_df.columns.

params: a Dictionary that contains the model params as required by Apache Spark, such as {'regParam': 0.001}

mode: a string that defines the mode type such as classification or regression

returns

output_df: dataframe that contains numeric features, transformed features, and labels.

Model Evaluation

Use the evaluate class to evaluate the model for a given dataframe.

Signature
from incorta_ml import evaluate
output_df=evaluate(input_df, model_name)
params

input_df: is a Spark dataframe contains feature columns and label column. The dataframe schema must match the schema of the training dataframe.

model_name: name of the evaulation model.

returns

output_df: a dataframe that contains two columns ‘metric_name’ and ‘value’ where each row represents a metric and an associated numeric value.

Incorta ML Regression Example

from incorta_ml import *  # imports all the functions
full_data = read("/path/to/winequality-white.csv")  # reads the data

# split the data into training and testing set.
split_df = full_data.randomSplit([0.7,0.3],1)
training_df = split_df[0]
prediction_df = split_df[1]

# build, train and save the model.
build_model(training_df, model_name='wine', algorithm_name='RandomForestRegressor',
                   label_column_name='label', params={"numTrees":10}, mode='regression')
# predict the testing data.
dftest_pred=predict(prediction_df, model_name='wine')
# evaluate the model on testing data
eval_df = evaluate(prediction_df, model_name='wine')

# show the evaluation metrics
eval_df.show()

Incorta ML Classification Example

from incorta_ml import * # imports all the functions
full_data = read("/path/to/iris.csv", "csv") # read the data

# split the data into training and testing set.
split_df = full_data.randomSplit([0.7,0.3],1)
training_df = split_df[0]
prediction_df = split_df[1]

# build, train and save the model.
build_model(training_df, model_name='iris', algorithm_name='RandomForestClassifier',
                   label_column_name='label', params={"numTrees":10}, mode='classification')

# predict the testing data.
dftest_pred=predict(prediction_df, model_name='iris')

# evaluate the model on testing data
eval_df = evaluate(prediction_df, model_name='iris')

# show the evaluation metrics
eval_df.show()

Analyzer

In Incorta 4.6, there are several enhancements to the Analyzer icluding:

Insight Descriptions

When editing a given dashboard insight with the Analyzer, you can now optionally specify a description. To create an Insight Description, follow these steps:

  • For an existing dashboard, for a given insight, in the Actions Menu, select Edit(Pen) or More Options(Kebab) and then Edit.
  • In the Analyzer, select Click to Edit Insight Description, and enter a description.
  • In the Action bar, select Done to save.

To view an Insight Description in a Dashboard, follow these steps:

  • For an existing dashboard, for a given insight, in the insight title, select the Information Icon.
  • In the Tooltip, view the insight description.

Insight Titles Support Variable References

In this release, you can now reference one or more variables in a dashboard insight title. A referenced variable in an insight title returns the original, initialized value. Supported variables types are:

  • Presentation Variables
  • Session Variables
  • System Variables

To reference a variable in an Insight Title, use the $$ syntax. Here are some examples:

  • To reference the user system variable, enter $$user.
  • To reference the current date system variable, enter $$currentDate.
  • To reference a presentation variable named pvGroupBy, enter `$$pvGroupBy`.
  • To reference an internal session variable named ivarIsUserInGroupAdmin, enter `$$ivarIsUserInGroupAdmin`.

Additional Performance Enhancements

In the 4.6 release, there are additional performance enhancements and changes:

Aggregate and Group By Query Performance Improvements

In this release, there are two enhancements to improve Aggregated and Group By query performance.
The first enhancement address pagination rendering for insight visualization. In this release, only the required rows for are processed instead of all rows in an insight visualization. The default number of rows is 1,000. To disable this default enabled feature, in the engine.properties file, set engine.paginate_aggregated_queries to false.

In certain cases pagination of rows is not applicable such as when an insight is being exported as a CSV file, MS Excel file, Pivot Table insights, or queried from another application over the SQLi Interface. In these cases, the Analytics Service now uses parallel sorting and parallel materialization to improve query performance. To disable this default enabled feature, in the engine.properties file, set engine.parallel_materialize_groups to false.

Tomcat Server Upgrade

In this 4.6 release, Incorta now uses Apache Tomcat V7.0.96.

Previous versions of Incorta used Apache Tomcat V7.0.65.

New Incorta installations will use Apache Tomcat V7.0.96 by default.

When upgrading from a previous version to Incorta 4.6, the upgrade process will install the new Tomcat version, but will preserve the existing user configuration files.

© Incorta, Inc. All Rights Reserved.