Concepts → Physical Schema Table Column
A physical schema table represents the metadata about an entity of structured data. As an entity object in a physical schema, the metadata and properties of a physical schema table also describe how the Loader Service stores the table’s data in shared storage and how the Analytics Loader service loads the table in to memory. Using the Table Editor, you can both create and edit the metadata for a physical schema table. This metadata includes the definition of Columns for structured data.
About a physical schema table column
A physical schema table column defines an attribute of an entity in terms of the attribute’s name, data type, function, and encryption.
Properties of a physical schema table column
A physical schema table column has several configurable properties. Some properties determine how Incorta stores the column data. Other properties influence the column’s representation and default behavior in tools such as the Analyzer and Formula Builder.
The Table Editor shows the properties of a column as follows:
Property | Control | Description |
---|---|---|
Show in Analyzer | checkbox | Select to show as a selectable column in the Data Panel |
Name | read only | The column’s name; used in the fully qualified name of the column |
Label | text box | Enter a user friendly name |
Type | drop down list | Select the Incorta Data Type of the column |
Function | drop down list | Select the Incorta Function of the column |
Encrypt | drop down list | Select if the column is encrypted in shared storage |
The following properties affect how the Loader Services stores data:
- Type
- Function
- Encrypt
Type property
For a new physical schema table, Incorta infers the column data type. For a table that is of type File System or Data Lake, Incorta samples one or more Text (csv, tsv, tab, txt) files to determine the most appropriate data type for each column. For a table that is of the type of SQL Database, Incorta uses the information from the JDBC connection to determine the most appropriate data type for each column.
In this manner, the Type property of a column identifies the data type for the column. The Loader Service extracts data from the data source and writes to
shared storage the columnar data in Apache Parquet file format. The Loader Services also writes data in a proprietary Direct Data Mapping file format. Incorta does not store data for a column with a Type of Null
.
As a schema developer, you can override the Type for a given column using the Table Editor. Please consider the following when selecting the most appropriate data type.
If you change the Type property, you will need to perform a full load of the table. If the column is used in a join, you must also perform a full load for all related tables, or, you must perform a full load of the physical schema.
For a multi-source table, the same named columns need to return the value of the same type. In the Table Editor, use Manage Outputs to select the common data type for the column.
Integer and Long
Incorta does not support unsigned data types. An Integer stores whole numbers from -2,147,483,648
to 2,147,483,647
. A Long
stores whole numbers from -9,223,372,036,854,775,808
to 9,223,372,036,854,775,807
. Whenever possible, select Integer. The data size for in-memory and disk usage is less than half that of Long.
Double
A Double stores a fractional value and is sufficient for storing up values with up to 15 decimal digits. For a table that is of type File System or Data Lake, Incorta may infer a Double value as an Integer or Long during sampling. During table load, rows that contain a Double value will be rejected. In the Load Job Viewer, review load job errors to identify the reasons for reject rows.
String and Text
For tables of the type SQL Database, the JDBC protocol may identify a column as Text. Instead, use String. Conceptually, there is no difference between the Text and String in terms of storage size. However, the Analyzer does not support adding a column of the type Text to a Grouping Dimension tray in the Insight panel.
Timestamp and Date
If you require a timestamp, use the Timestamp data type.
Some databases such as Oracle and SQL Server will store a time component for a column of the type DATE
. For this reason, Incorta will infer a Timestamp data type for a database DATE column.
If you need to cast or convert a Timestamp into a Date, you have several options:
- Change the column type to Date and the Loader Service will cast the timestamp to a date
- As many SQL databases offer built-in cast and conversion functions for converting a timestamp to date natively, consider modifying the SQL statement for the Query property and add a SQL built-in function that converts the timestamp to a date, that is if the data source is of the type SQL Database
- Create a physical schema table formula column of the type Date and create a formula expression that removes the time from the timestamp using the
removeTime()
built-in function.
Function property
The Function property of a physical schema table column not only determines how the Loader Service stores data for the column, but also influences how tools such as the Analyzer assign a specific behavior to the column. For example, once select in the Data panel of the Analyzer, the Analyzer will automatically add a column that functions as a Measure to the Measure tray.
There are three values possible for the Function property:
- Key
- Dimension
- Measure
Function: Key
A column that functions as key defines a uniqueness constraint. For a physical schema table, you can specify zero or more key columns. Two or more key columns defines a composite key for the given physical schema table.
As a unique constraint, the Loader Service will enforce the uniqueness of rows based on the defined key column or composite keys. The Loader Service creates a Direct Data Mapping index file (.ZTI
) for a table’s key column or key columns.
In the Data Panel of the Analyzer and the Formula Builder, a key column shows as having a key icon.
If you change the Function property of a column to Key, you will need to perform a full load of the table. If the column is used in a join, you must also perform a full load for all related tables, or, you must perform a full load of the physical schema.
Function: Dimension
A dimension categorizes, describes, or groups data. Examples of a dimension column are State, Year, or City. Dimension columns do not require unique values.
A dimension column has an associated behavior in the Analyzer. When a user double-clicks a dimension column in the Data panel, the Analyzer automatically adds the column to the Grouping Dimension tray in the Insight panel.
As an example, a bar chart insight plots the Year grouping dimension along the x-axis. A coloring dimension for months splits the x-axis into years with subsections of months.
Function: Measure
A measure is a measurement of data. Examples of a measure column are Revenue, Quantity, and Cost.
A measure column has an associated behavior in the Analyzer. When a user double-clicks a measure column in the Data panel, the Analyzer automatically adds the column to the Measure tray in the Insight panel.
In the prior bar chart insight example, the chart displays Revenue as a measurement along the y-axis. The chart groups and colors Revenue per Month and Year by the corresponding dimensions along the x-axis.
Encryption property
Using the Table Editor, you can explicitly specify a column to be encrypted.
For the given table, the Loader Service extracts the unencrypted values from the data source and writes the encrypted values in shared storage in Apache Parquet file format. When reading the Apache Parquet files for the given table, the Analytics Service decrypts the encrypted column values.
The Loader Service encrypts data using built-in, 128-Bit AES encryption.
Apache Spark can read the encrypted value, but is unable to decrypt the value of an encrypted column. The encrypted value is of the type String.
If you change the Encryption property for a column, you will need to perform a full load of the table. If the column is used in a join, you must also perform a full load for all related tables, or, you must perform a full load of the physical schema.
Additional Considerations
Here are some additional considerations about how various tools support a physical schema table column.
A physical schema table column in the Analyzer
When using the Analyzer, you select a column from the Data panel to add it to the Insight panel.
In the Data panel, a column will show a visual icon that represents its Data Type such as a hashtag for a double or a clock for a timestamp. The Data panel also shows the label for the table column.
Although there is a default behavior for a dimension and a measure in the Analyzer, you can select a column from the Data panel and manually drag and drop the column to any tray in the Insight panel.
In the Data panel of the Analyzer, you can click on the column details icon to view the column properties and sample data.
A physical schema table column in the Formula Builder
Just like the Analyzer, when using the Formula Builder, you select a table column from the Data panel to add the column to the formula expression in the formula editor.
In the Data panel, a column will show a visual icon that represents its Data Type such as a hashtag for a double or a clock for a timestamp. The Data panel also shows the label for the table column.
The formula editor displays the fully qualified table column name in the expression and not the column label.