Use Spark with Incorta
Use the following steps to configure the Spark environment.
- Backup the Spark environment Configuration file
SPARK_HOME/conf/spark-env.sh
. - Add the following configuration information and run the file.
MySql
SPARK_HIVE=true
export
CLASSPATH="\$CLASSPATH:/Users/incorta/incorta_home/server/lib/mysql-connector-java-5.1.25-bin.jar"
export SPARK_CLASSPATH=\$CLASSPATH
export SPARK_SUBMIT_CLASSPATH=\$CLASSPATH
Oracle
export SPARK_HIVE=true
export CLASSPATH="\$CLASSPATH:/Users/incorta/incorta_home/server/lib/ojdbc7.jar"
export SPARK_CLASSPATH=\$CLASSPATH
export SPARK_SUBMIT_CLASSPATH=\$CLASSPATH
- Create the
SPARK_HOME/conf/hive-site.xml
Hive configuration file with the following content:
MySQL
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:Mosul://localhost:3306/mydb?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>1234</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
Oracle
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:oracle:thin:@localhost:1521:xe</value>
<description>metadata is stored in an Oracle server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>oracle.jdbc.driver.OracleDriver</value>
<description>Oracle JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>user name for connecting to Oracle server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hiveuserpassword</value>
<description>password for connecting to Oracle server</description>
</property>
</configuration>
- From the Incorta Admin console, modify the memory allocated to Spark. The default is 4GB.
-
Verify that the following files use the correct machine name, localhost, or IP address.
startSpark.sh
spark/conf/spark-env.sh
spark/conf/spark-defaults.conf
Incorta/server.properties
- Update the
.bashrc
and.bash_profile
files with the following:
export JAVA_HOME=/usr/java/jdk64-1.8.0_91
CentOS only
export
PATH=$PATH:$JAVA_HOME/bin_JAVA_OPTIONS="-Dorg.xerial.snappy.tempdir=/home/incorta/tmp"
export \_JAVA_OPTIONS
Start Spark
After determining the Spark ports, use the following steps to modify the Incorta server.properties file to integrate the Incorta and Spark instances.
- Navigate to
/Users/<user-name>/incorta_home/incorta
. -
Edit the server.properties file, adding the following values:
spark.home
– the location of the Spark installation to be used with Incortaspark.master.url
– the location url for the Spark Master web pagenode.name
- created during installation. Do not edit.
- Restart the Tomcat server using the
./stop.sh
and./start.sh
commands. - Start the Spark server
- Review the Master Server log, the location of which is broadcast at the terminal.
- Verify the ports being used by the Spark server by viewing the Spark log files with the following command:
cat
../logs/<spark instance>.apache.spark.deploy.master.Master-1-<machine>.local.out
-
Review the file output for the Spark master ports:
- SparkMaster machine and port number:
spark://<hostname>.local:7077
- Master port number: port 7077
- MasterWebUI IP address and port number: 8080
- SparkMaster machine and port number:
- Start the Spark slave with the following command (Linux):
udo ./start-slave.sh spark://<master-spark-url>:7077
. The workerWebUI is port is 8081. - Log into the Admin UI (
http://<server>:<port>/incorta/admin
) and complete the following steps: - In the ‘Server Config’ tab confirm if Always Compact is “On”. For any dataset you intend to enrich with Spark, you need to load (full load) data again after turning on compaction.
- Make sure that the Spark master node url is set, you can get that from the log file created when Spark starts.
Spark Notes
You can use default ports for both Spark and Incorta. These ports can be changed. Incorta uses 8080 as a default so the Spark port must be changed. Check to see these ports are not being used by other applications.
- Port for service to listen on (default: 7077 for master, random for worker)
- Port for web UI (default: 8080 for master, 8081 for worker)
For more information, visit: http://spark.apache.org/docs/latest/spark-standalone.html
When configuring Spark, verify Incorta isn’t running against the same incorta_home as the Spark server.
tail -f server/logs/catalina.out
Spark can be used in standalone mode for:
- Start and stopping a cluster
- Reading log files
- Use the cat command
Validate the Spark Setup
Use the following steps to validate your Spark setup.
- Create a new Materialized View in an existing schema. For Language select ‘SQL’ or ‘Python’. Not all Python libraries are available and SparkSQL can only query from one table. See Troubleshooting Tips to get more Python libraries installed.
- If you select SQL add a valid SQL query referencing the Incorta schema table
- If you get and ‘ERROR Executor: Exception in task’ try decreasing the number of executors to 1 (the default is 2) in the
spark/conf/spark-env.sh
directory. - If you select Python then you need to write a PySpark script and must include the read() and save() commands. Example of using Spark and ML https://spark.apache.org/docs/latest/ml-guide.html
Troubleshoot Spark
If you receive an error message regarding missing Python libraries or modules you may need to install new libraries/modules. To resolve this issue:
from pip /bin
sudo pip install <module>
from miniconda /bin
sudo conda install <module>
Configure and Validate Drill
Use the following steps to configure and validate your Drill installation:
- Confirm Drill has been allotted enough memory
-
Navigate to the
<incorta-home>/drill/conf
directory and modify the following lines to increase Drill’s memory to 16G or 32G (or whatever is required):export DRILL_HEAP=\${DRILL_HEAP:-“32G”}
export DRILL_MAX_DIRECT_MEMORY=\${DRILL_MAX_DIRECT_MEMORY:-“32G”}
- Ensure Java compaction is enabled from the Admin UI:
- Go to Admin UI -
http://<server>:<port>/incorta/admin/
and in the Server Config tab confirm if Always Compact is “On”. For any dataset you intend to enrich with Spark, you need to >load (full load) data again after turning on compaction. - Run
<incorta home>/startall.sh
to start Drill (and all other servers). - Validate Drill by creating a Drill data source connection in Incorta. The
datasource connection should use a Postgres driver and have a connection string
like:
jdbc:postgresql://localhost:5438/<tenant_name>
.
Use the following sample SQL to test your Drill connection by creating a table using the following example SQL syntax. Names for schemas, tables, and columns are case-sensitive.
-- requires stock Sales demo database. Be sure to replace the <schema-name> tag
with the name of your specific schema
SELECT SUM(AMOUNT_SOLD) AS TotalSales, CUSTOMER_ID FROM <schema-name>.sales
GROUP BY CUSTOMER_ID
-- requires AdventureWorks dataset loaded. Be sure to replace the <schema-name>
tag with the name of your specific schema
SELECT soh.CustomerID, COUNT(DISTINCT soh.SalesOrderNumber) AS
NumOrders, COUNT(DISTINCT sod.ProductID) AS NumProducts, COUNT(DISTINCT
p.ProductSubcategoryID) AS NumSubCat,
UNIX_TIMESTAMP(NOW(),'yyyy-MM-dd HH:mm:ss.SSS') -
UNIX_TIMESTAMP(max(soh.OrderDate),'yyyy-MM-dd') AS SecondsAgoShopped,
SUM(sod.OrderQty) AS ItemsBought,
SUM(sod.UnitPrice) AS AmtSpent
FROM <schema-name>.salesorderdetail sod, <schema-name>.salesorderheader soh,
<schema-name>.product p
WHERE sod.SalesOrderID = soh.SalesOrderID AND sod.ProductID = p.ProductID
GROUP BY soh.CustomerID