Concepts → Materialized View

About a Materialized View

A materialized view is an entity object in a physical schema that defines an Apache Spark job. The definition of the job consists of a script in a Spark compatible language. The job typically reads data for existing entity objects in a physical schema using the related Apache Parquet files in shared storage. During a load job of the parent physical schema or the materialized view, the related Spark job saves the resulting dataframe to shared storage in a Parquet file.

Important

In order to both successfully define the script for a materialized view and to load a materialized view, Apache Spark must be running and available in the Incorta cluster.

A materialized view depends on existing entity objects in a physical schema (and in certain cases, can include runtime business views). This dependency defines a materialized view as a type of derived table. As a derived table, a materialized view processes already extracted data and typically functions to transform and enrich data within a given physical schema.

Properties of a Materialized View

The properties of a materialized view are:

  • Name
  • Data Source(s)
  • Columns
  • Formula Columns
  • Runtime Security Filters
  • Parent Joins
  • Child Joins
  • Load Filter

Name

A materialized view name must adhere to the following name validation rules:

  • Must be unique to the physical schema
  • Must be between 1 and 128 characters in length
  • Must begin with an alpha character, lower or upper case, (_), or ($)
  • After the first alpha character, can contain zero or more alphanumeric characters in lower, upper, or mixed case
  • After the first alpha character, can contain zero or more underscore (_) or Dollar sign ($) characters
  • Besides underscore (_) and Dollar sign ($), cannot contain special characters, symbols, or spaces
Important

The name for a materialized view is immutable.

Data Source

The data source for a materialized view is Apache Spark. A materialized view only supports a single-source dataset. A schema developer can combine multiple dataframes with the script of a materialized view.

Type

The data source type should be set as Materialized View. Changing the data source type will change the type of table created.

Language

Following are the languages available for writing your script:

  • Spark SQL
  • Spark Python
  • Spark Scala
  • Spark R
  • Incorta PostgreSQL
Note

Only when using Incorta PostgreSQL are you able to reference one or more business schema views or aliases.

Incremental

A materialized view supports Incremental loads. You can enable incremental consumption of rows from the data source dialog. When enabled, the Incremental Script dialog will become available. A schema developer must define the materialized view script for the incremental load.

Warning

An incremental load script should differ from the normal script. An incremental load job necessitates background compaction of Parquet files. A normal script run as an incremental load job may generate an unnecessary workload and Parquet compaction, causing load performance issues.

Script

You write the script for a materialized view using two different editor options:

Edit query

The Edit Query window will provide syntax highlighting and formatting capabilities. Incorta will validate the script when completing the data source configuration.

Edit in notebook

The notebook editor allows you to iteratively build, validate, and test queries within the editor. You can enable the notebook editor from the Cluster Management Console (CMC). The notebook editor is an Incorta Labs feature in your tenant configurations.

Note

The notebook editor is not available when using Incorta PostgreSQL.

IncortaML

From the notebook editor, you can use IncortaML (Incorta Machine Learning) to add machine learning models to your materialized view. You can utilize IncortaML to train models using regression and classification modeling.

Properties

When you create a materialized view, you can define the Spark properties for the load job. A CMC Administrator sets the default properties for a materialized view in the tenant’s configuration.

Recommendation

You should only edit the Spark properties of the materialized view to manage specific memory constraints and with an understanding of the Spark configuration.

Following are configurable Spark properties:

Property Default Description Notes
spark.driver.memory 1g Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in MiB unless otherwise specified (e.g. 1g, 2g). This is the memory allocated for the Spark logical plan and any Spark actions that transfer the data to the driver.
spark.executor.memory 1g The amount of memory to use per executor process, in MiB unless otherwise specified.
spark.cores.max No default The maximum amount of CPU cores to request for the application from the cluster.
spark.executor.cores The default is 1 with Spark configured using YARN. The number of cores to use on each executor. Set equal to spark.cores.max.
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations. Initially set the shuffle partition equal to cores.max. As the data increases, you can increase the partitions.
spark.driver.maxResultSize 1g Limit of the total size of serialized results of all partitions for each Spark action in bytes. Set to at least 1M or 0 for unlimited. Spark will abort the job if the total size is above this limit. Having a high limit may cause out-of-memory errors in the driver (depending on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
spark.memory.fraction 0.6 Fraction of heap space used for execution and storage. The lower the value, the more frequently spills and cached data eviction can occur.
spark.serializer org.apache.spark.serializer. JavaSerializer Used for serializing objects that will be sent over the network or need to be cached in serialized form. Do not configure this parameter, if you are using RDDs/DataFrame over custom data types that do not have proper implementation for Kryo serialization (e.g. avoid using this parameter for machine learning Materialized Views if you have RDDs of custom data types “non-primitive”)

Columns

Once you create a script, the referenced columns will be visible in the Table Editor. See Concepts → Materialized View Column for information on the column properties.

Referenceable Columns

Following are the possible types of referenceable columns in a materialized view script:

Note

Spark Referenceable refers to the following languages used with Spark: Spark SQL, Spark Python, Spark Scala, and Spark R.

Table or View Type Column Type Spark Referenceable Incorta PostgreSQL Referenceable Note
Alias Data-backed column No Yes
Alias Formula column No Yes
Business Schema View Data-backed column No Yes
Business Schema View Formula column No Yes Session Variables are not referenceable
Incorta Analyzer Table Data-backed column No No
Incorta Analyzer Table Formula column No No
Incorta SQL Table Data-backed column No No The Incorta SQL Table is an Incorta Labs feature and is enabled in the CMC.
Incorta View Data-backed column No No
Incorta View Formula column No No
Materialized View Data-backed column Yes Yes
Materialized View Formula column No Yes
Physical Schema Table Data-backed column Yes Yes
Physical Schema Table Formula column No Yes

Formula Columns

You create a materialized view formula column using the Formula Builder. Materialized view formula columns have several restrictions related to the materialized view’s join configurations. For more information see Concepts → Materialized View Formula Column.

Runtime Security Filters

As a schema developer, you can apply one or more runtime security filters to restrict row access to a materialized view. Any dependent object such as a runtime business view, dashboard insight, dashboard filter, or internal session variable will automatically apply the runtime security filter. To learn more about a physical schema table runtime security filters, see Concepts → Runtime Security Filter.

Parent Joins

The materialized view is the parent table in a join relationship to another entity object in the same or different physical schema. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.

A parent table typically has one key column or composite key columns. In both cases, the key column functions as a primary key. Where there is a key column, the join condition typically contains an equals = join operator.

It is possible to create a join from a child to parent without using a defined key column in the parent table. A table without a key column allows for duplicate values. If the parent table contains duplicate non-key rows, the Loader Service randomly selects a parent table row based on the non-key join. To help simulate a key for the parent table, consider using one or more conditions in the join definition such as > greater than and < less than. Multiple join operators may significantly narrow join values from the parent table so as to create row uniqueness. A join condition with multiple join operators requires internal sorting which may require high calculation costs for the join path.

Recommendation

Whenever possible, create a key column or a composite key column for a parent table. Avoid making a join to a parent table that does not have a key column or composite key columns.

Warning

Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.

Child Joins

The materialized view is the child table in a join relationship to another entity object in the physical schema or another physical schema within the tenant. A child table typically has a column that serves as a foreign key. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.

Join inheritance

Materialized views do not inherit join relationships from source entities and will need to have their own join relationships created. The data sources available to a materialized view formula column are directly related to the child joins of the materialized view.

Warning

Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.

Load Filter

For a materialized view, you can specify a load filter in the Table Editor. A load filter reduces the number of rows that the Analytics Service loads into memory from shared storage, but does not affect how many rows the Loader Service extracts from the data source of the table.

Additional Considerations

Script referencing a schema entity

A script referencing a column in a physical schema or business schema entity must identify the schema and entity. As an example, a SQL query would identify the schema and entity as SCHEMA.ENTITY.

Script restrictions

Schema restrictions

When you create your materialized view using Incorta PostgreSQL, you are able to reference a business schema view or an alias. These options are not available when using the other provided spark languages.

Session variables

A materialized view cannot reference business schema entities that contain session variables. Attempting to do so will result in an error when attempting to validate the script.

Spark jobs concerning a materialized view

When you manipulate the Spark properties of a materialized view, you may need to see the spark job execution of the materialized view. Any Spark job changes should be made in coordination with your System Administrator.

You can view the Spark job processes at the following URL: http://<server address>:9091

When you validate your script in the query editor, the materialized view will display as one spark job.

While you are running the notebook editor in a materialized view, Spark will display the notebook editor as a running application. Once you finish using the notebook editor and you validate the materialized view script, you will see a second job run when the materialized view is validated.


© Incorta, Inc. All Rights Reserved.