Use Spark with Incorta

Use the following steps to configure the Spark environment.

  1. Backup the Spark environment Configuration file SPARK_HOME/conf/
  2. Add the following configuration information and run the file.
export SPARK_HIVE=true
export CLASSPATH="\$CLASSPATH:/Users/incorta/incorta_home/server/lib/ojdbc7.jar"
  1. Create the SPARK_HOME/conf/hive-site.xml Hive configuration file with the following content:
<?xml version="1.0" encoding="UTF-8"?>
<description>metadata is stored in a MySQL server</description>
<description>MySQL JDBC driver class</description>
<description>user name for connecting to mysql server</description>
<description>password for connecting to mysql server</description>


<?xml version="1.0" encoding="UTF-8"?>
<description>metadata is stored in an Oracle server</description>
<description>Oracle JDBC driver class</description>
<description>user name for connecting to Oracle server</description>
<description>password for connecting to Oracle server</description>
  1. From the Incorta Admin console, modify the memory allocated to Spark. The default is 4GB.
  2. Verify that the following files use the correct machine name, localhost, or IP address.

    • spark/conf/
    • spark/conf/spark-defaults.conf
    • Incorta/
  3. Update the .bashrc and .bash_profile files with the following:
export JAVA_HOME=/usr/java/jdk64-1.8.0_91
CentOS only


Start Spark

After determining the Spark ports, use the following steps to modify the Incorta file to integrate the Incorta and Spark instances.

  1. Navigate to /Users/<user-name>/incorta_home/incorta.
  2. Edit the file, adding the following values:

    • spark.home – the location of the Spark installation to be used with Incorta
    • spark.master.url – the location url for the Spark Master web page
    • - created during installation. Do not edit.
  3. Restart the Tomcat server using the ./ and ./ commands.
  4. Start the Spark server
  5. Review the Master Server log, the location of which is broadcast at the terminal.
  6. Verify the ports being used by the Spark server by viewing the Spark log files with the following command: cat ../logs/<spark instance>.apache.spark.deploy.master.Master-1-<machine>.local.out
  7. Review the file output for the Spark master ports:

    • SparkMaster machine and port number: spark://<hostname>.local:7077
    • Master port number: port 7077
    • MasterWebUI IP address and port number: 8080
  8. Start the Spark slave with the following command (Linux): udo ./ spark://<master-spark-url>:7077. The workerWebUI is port is 8081.
  9. Log into the Admin UI (http://<server>:<port>/incorta/admin) and complete the following steps:
  10. In the ‘Server Config’ tab confirm if Always Compact is “On”. For any dataset you intend to enrich with Spark, you need to load (full load) data again after turning on compaction.
  11. Make sure that the Spark master node url is set, you can get that from the log file created when Spark starts.

Spark Notes

You can use default ports for both Spark and Incorta. These ports can be changed. Incorta uses 8080 as a default so the Spark port must be changed. Check to see these ports are not being used by other applications.

  • Port for service to listen on (default: 7077 for master, random for worker)
  • Port for web UI (default: 8080 for master, 8081 for worker)

For more information, visit:

When configuring Spark, verify Incorta isn’t running against the same incorta_home as the Spark server.

tail -f server/logs/catalina.out

Spark can be used in standalone mode for:

  • Start and stopping a cluster
  • Reading log files
  • Use the cat command

Validate the Spark Setup

Use the following steps to validate your Spark setup.

  1. Create a new Materialized View in an existing schema. For Language select ‘SQL’ or ‘Python’. Not all Python libraries are available and SparkSQL can only query from one table. See Troubleshooting Tips to get more Python libraries installed.
  2. If you select SQL add a valid SQL query referencing the Incorta schema table
  3. If you get and  ‘ERROR Executor: Exception in task’ try decreasing the number of executors to 1 (the default is 2) in the spark/conf/ directory.
  4. If you select Python then you need to write a PySpark script and must include the read() and save() commands. Example of using Spark and ML

Troubleshoot Spark

If you receive an error message regarding missing Python libraries or modules you may need to install new libraries/modules. To resolve this issue:

from pip /bin

sudo pip install <module>

from miniconda /bin

sudo conda install <module>

Configure and Validate Drill

Use the following steps to configure and validate your Drill installation:

  1. Confirm Drill has been allotted enough memory
  2. Navigate to the <incorta-home>/drill/conf directory and modify the following lines to increase Drill’s memory to 16G or 32G (or whatever is required):

    • export DRILL_HEAP=\${DRILL_HEAP:-“32G”}
  3. Ensure Java compaction is enabled from the Admin UI:
  4. Go to Admin UI - http://<server>:<port>/incorta/admin/ and in the Server Config tab confirm if Always Compact is “On”. For any dataset you intend to enrich with Spark, you need to >load (full load) data again after turning on compaction.
  5. Run <incorta home>/ to start Drill (and all other servers).
  6. Validate Drill by creating a Drill data source connection in Incorta.  The datasource connection should use a Postgres driver and have a connection string like: jdbc:postgresql://localhost:5438/<tenant_name>.

Use the following sample SQL to test your Drill connection by creating a table using the following example SQL syntax. Names for schemas, tables, and columns are case-sensitive.

-- requires stock Sales demo database.  Be sure to replace the <schema-name> tag
with the name of your specific schema

SELECT SUM(AMOUNT_SOLD) AS TotalSales, CUSTOMER_ID FROM <schema-name>.sales

-- requires AdventureWorks dataset loaded.  Be sure to replace the <schema-name>
tag with the name of your specific schema

SELECT soh.CustomerID, COUNT(DISTINCT soh.SalesOrderNumber) AS

NumOrders, COUNT(DISTINCT sod.ProductID) AS NumProducts, COUNT(DISTINCT

p.ProductSubcategoryID) AS NumSubCat,

UNIX_TIMESTAMP(NOW(),'yyyy-MM-dd HH:mm:ss.SSS') -
UNIX_TIMESTAMP(max(soh.OrderDate),'yyyy-MM-dd') AS SecondsAgoShopped,

SUM(sod.OrderQty) AS ItemsBought,

SUM(sod.UnitPrice) AS AmtSpent

FROM <schema-name>.salesorderdetail sod, <schema-name>.salesorderheader soh,
<schema-name>.product p

WHERE sod.SalesOrderID = soh.SalesOrderID AND sod.ProductID = p.ProductID

GROUP BY soh.CustomerID

© Incorta, Inc. All Rights Reserved.