Tools → Notebook Editor
About the Notebook Editor
The Notebook Editor is an interactive Apache Zeppelin notebook environment that allows you to explore, manipulate, and transform data for a materialized view in a physical schema. The Notebook Editor supports several languages including PySpark, R, Scala, Spark SQL, and PostgreSQL. You can use the Notebook Editor to iteratively code and explore your data before saving code for the materialized view.
- A notebook consists of one or more paragraphs.
-
A paragraph consists of a code section and a result section.
- In the code section, you can use a language-specific editor to write code in the following languages: Spark SQL, Spark Scala, Spark Python, Spark R, and Incorta Postgres SQL. You can execute code in the code section using paragraph commands.
- When there are executed results, you can view the output in the result section of the paragraph.
The Notebook Add-on service runs as an application in Apache Spark and manages the paragraph execution request. When running more than one paragraph, the Notebook Add-on service application processes each paragraph sequentially: when the first paragraph completes, the second is started.
Notebook Requirements
There are several requirements for implementing the Notebook Integration:
- The Linux operating system must be supported.
- Apache Spark 2.4.3 or a later version of Spark 2 must be running and properly configured for the Incorta Cluster instance.
- An Incorta Cluster can only have a single Notebook Add-on.
- The Incorta Node hosting the Notebook Add-on requires Python 2.7, Python 3.6, or Python 3.7. Python 3.8 is not yet supported. The requests Python module should be included in the installation.`
- To use R language in materialized views, the following should be installed: R 3.4 or above, Stringi, Stringr, SparkR, and Knitr
- On the Incorta Node hosting the notebook, the default port 5500 must be open or the configured port must be open.
Notebook Editor Access Permissions
A user that belongs to a group with the Schema Manager or the SuperRole role can access the Notebook Editor.
Code Execution Language for a Notebook
When creating a materialized view, in the Data Source dialog, you must select a Language. The choices are: Spark SQL, Spark Scala, Spark Python, Spark R, and Incorta Postgres SQL.
Apache Spark executes all materialized views and natively runs Spark SQL queries using columnar data stored as Apache Parquet files in Shared Storage (Staging).
Notebook Editor Anatomy
- Action bar
- Toolbar
- Paragraph
- Footer bar
Action Bar
The Action bar is located at the top of the Notebook Editor. Use the Action bar to:
- View the notebook language for export to the materialized view.
- Close the Notebook Editor by selecting the X.
- Maximize or minimize the Notebook Editor dialog by selecting the arrows.
Toolbar
The Toolbar is located directly below the Action bar in the Notebook Editor. Use the Toolbar to:
- Run all paragraphs (Play icon)
- Show/hide the code in all paragraphs (Four arrow icon)
- Show/hide the output of all paragraphs (Book icon)
- Clear the output of all paragraphs (Eraser icon)
- Search the code of all paragraphs (Magnifying glass icon)
- View the list of keyboard shortcuts (Keyboard icon)
Paragraph
The Paragraph section is located below the Toolbar, and contains one or more paragraphs. Use the Paragraph to write code and perform the following operations:
- Run the paragraph (Play icon). Before it is run, the paragraph has a status of READY. After it is run, the paragraph has a status of FINISHED.
- Hide the editor in the paragraph (Four arrow icon)
- Show the output of the paragraph (Book icon)
-
View additional paragraph settings (Gear icon). The following settings are available:
- Width. Adjust the frame width of the selected paragraph.
- Font size. Adjust the text font size in the selected paragraph.
- Move up. This option is visible only when there is a paragraph above the selected paragraph. Move the code a paragraph up.
- Move down. This option is visible only when there is a paragraph below the selected paragraph. Move the code a paragraph down.
- Insert new. Insert a new paragraph below the selected paragraph..
- Run all below. This option is visible only when there is a paragraph below the selected paragraph. Run the code in the selected paragraph and all subsequent paragraphs.
- Run all above. This option is visible only when there is a paragraph above the selected paragraph. Run the code in the selected paragraph and all preceding paragraphs.
- Clone paragraph. Add a copy of the selected paragraph directly below it.
- Show/hide title. Show or hide the paragraph title.
- Show/hide line numbers. Show or hide the paragraph line numbers.
- Disable/Enable run. Hide or show the run icon for the selected paragraph.
- Link this paragraph. Open the output of the paragraph on a separate browser tab.
- Clear output. Clear the output of the selected paragraph.
- Remove. Delete the selected paragraph.
Footer Bar
The Footer bar is located at the bottom of the Notebook Editor.
- Select Cancel to close the Notebook Editor without saving changes.
- Select Done to save the materialized view.
Notebook Integration Process
Before using the Notebook add-on, a CMC Administrator must first integrate the Notebook into an Incorta Cluster. Notebook Integration requires the completion of several key tasks in the CMC:
- Create the Notebook Add-on service.
- Set the Notebook integration properties in Server Configurations.
- Enable the Notebook Integration.
- Start the Notebook service.
Create the Notebook Add-On
You can install the Notebook Add-on during a new installation or after installation.
There are two types of cluster installations:
- Single Host is a standalone instance using the Typical installation method.
- Multi-host requires a Custom installation. Both cluster typologies are applicable to Incorta Notebooks.
To configure and install a Notebook Add-on during a Single Host (typical) Installation:
- In the Configuration Wizard, for Add-ons, specify the Notebook port value (the default value is 5500).
- Select Next to continue the configuration review.
- Select Create.
Here are the steps to configure and install a Notebook Add-on after a Single Host (Typical) or Multi-host (Custom) installation:
After a Single Host (Typical) Installation:
- In the navigation bar, select Nodes.
- In the nodes list, select the localNode.
- In the canvas, select the Add-ons tab.
- In the Add-ons header, select + (Add) to create a Notebook.
- In the Create a new notebook dialog, enter the Port number. The default value is 5500.
- Select Save.
After Multi-host (Custom) Installation:
- In the navigation bar, select Nodes.
- In the nodes list, select an Incorta Node.
- In the canvas, select the Add-ons tab.
- In the Add-ons header, select + (Add) to create a Notebook.
- In the Create a new notebook dialog, enter the Notebook Name and the Port number. The default value is 5500.
- Select Save.
Set Materialized View Properties
The Materialized View settings are global to all tenants in a cluster configuration. For the selected Cluster, you can set values for the following settings:
- Materialized view application cores: The number of CPU cores reserved for use by materialized view. The default value is 1. The allocated cores for all running Spark applications cannot exceed the dedicated cores for the cluster unless Dynamic Allocation is enabled. When Dynamic Allocation is enabled the value will be used to compute the CPU cores for the initial executors.
- Materialized view application memory: The number of gigabytes of maximum memory to use for materialized view. The default is 1 GB. The memory for all Spark applications combined cannot exceed the cluster memory (in gigabytes).
- Materialized view application executors: The maximum number of executors that can be spawned by a single materialized view application. Each of the executors will allocate a number of the cores defined in sql.spark.mv.cores, and will consume part of the memory defined in sql.spark.mv.memory. Note that the cores and memory assigned per executor will be equal for each executor, hence the number of executors should be a divisor of sql.spark.mv.cores and sql.spark.mv.memory, (e.g. configuring an application with cores=4, memory=8, executors=2, will result in spawning 2 executors, with each executor consuming 2 cores/4GB from the cluster).
To modify these settings:
- In the navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Cluster Configurations.
- In the panel tabs, select Server Configurations.
- In the left pane, select Spark Integration.
- Set the value(s) for Materialized view application cores, Materialized view application memory and/or Materialized view application executors.
- Select Save.
Enable the Notebook Integration
After Notebook Integration properties are set, then you can enable the Incorta Labs Notebook feature.
To enable Notebook Integration as the default tenant configuration in the CMC:
- In the navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Cluster Configurations.
- In the panel tabs, select Default Tenant Configurations.
- In the left pane, select Incorta Labs.
- In the right pane, toggle Notebook Integration to enable.
- Select Save.
To enable Notebook Integration for a specific tenant configuration in the CMC:
- In the navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select the Tenants tab.
- In the Tenant list, select Configure for the given Tenant.
- In the left pane, select Incorta Labs.
- In the right pane, toggle Notebook Integration to enable.
- Select Save.
Start, Stop, and Restart Notebook
To start, stop, and restart a Notebook:
- In the navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Add-ons.
- In the nodes list, select the Notebook name.
- In Notebook details, select Restart, Stop, or Start.
Edit the Notebook Port
- In the navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Add-ons.
- In the nodes list, select the Notebook name.
- In Notebook details, select Edit (Pencil icon).
- Change the Port value.
- Select Update.
After choosing a different Notebook port you must restart the Notebook for changes to take effect.
Create a Materialized View with the Notebook Editor
Here are the steps to create a materialized view with the Notebook Editor:
- For the given schema in Schema Designer, in the Action bar, select + New.
- In the Add New menu, select Derived Table → Materialized View.
- In the Data Source dialog, select a Language.
- In Script, select Edit in Notebook.
- In one or more paragraphs, enter the code for the materialized view.
- Select Done.
- To specify additional materialized view properties, select Add Property.
- Select Add.
- Specify a Table Name.
- In the Action bar, select Done.
Test with Notebook Sampling
Use the notebook sampling feature when you would like to test the notebook with a subset of data in a large table in order to make execution faster. Here are the notebook sampling properties you can add:
Property | Description |
---|---|
notebook.dataframe.limit |
Enter a value for the dataframe number of rows |
notebook.dataframe.sampling.percentage |
Enter the percentage of dataframe sampling. Valid values are between 1 and 100. |
notebook.dataframe.sampling.seed |
Optionally enter the seed used in sampling |
The notebook sampling properties you add will be applied to every dataframe, but will not affect the execution of the materialized views.
Notebook Configurations Precedence
If you add both the notebook.dataframe.limit
and the notebook.dataframe.sampling.percentage
properties, the notebook.dataframe.sampling.percentage
property will be applied first, and the notebook.dataframe.limit
property will be applied second.
Add a Notebook Sampling Property
Here are the steps to add a notebook samping property to your materialized view:
- Within the schema, open the materialized view.
- In the Data Source dialog, under Properties:, select Add Property.
- In key:, enter the notebook sampling property name.
- In value:, enter the notebook sampling property value.
- Select Validate.
- In the Action bar, select Done.
Additional Considerations
- In certain cases, a notebook paragraph in R will show a status of FINISHED even though the paragraph output reports an error. Some errors will show in the Data Source dialog. Check the application logs for the root cause of the stack trace.
- A SparkR notebook has the %r declaration. You must call the
save(dataframe)
method to persist the materialized view.