Troubleshoot Spark

Troubleshoot common issues with using Spark with Incorta.

You might experience one or more of the following issues when you run queries in Incorta using Spark.

Before you troubelshoot using the common issues on this page, verify that you configured Spark and Incorta according to the minimum configuration standards.

Configuration Issues

The following issues can occur with Spark when you use Spark in Incorta.

Problem: Disk Space

Disk space of some worker(s) is full, thus, queries fail.

Symptoms

You send a query to the Spark
Query fails with either:
“No disk space left on device” error
Fail to get the metadata for a query due to “No disk space left on device” error

Side effect: Spark application will restart automatically, hoping that by cleaning itself up, it may free some disk space for the next run

Solution

You may encounter this problem in one of the following scenarios:

Spark is badly configured, e.g.: the executor is given too little memory, so, it spills on disk a lot, writing data over and over again
The Spark application has been running for a long time without cleanup, so, it has accumulated a lot of logs and metadata
The disk space assigned to the worker machine is too small for the query at hand Potential fixes You should inspect which worker disk has been full and either:
Check if Spark working directory is on a disk with enough available memory
This configuration can be found in:SPARKHOME/conf/spark-defaults.conf and SPARKHOME/conf/spark-env.sh
For more details, check this section
Free some space by deleting unneeded logs and metadata
Mounting a new disk and add it as a Spark working directory
Tune Spark configuration to be less prone to spill to disk
Keep in mind there may be other application running on the same Spark instance, so, you don’t want to greedy in consuming resources

Problem: Spark Did Not Start

Spark application failed to start due to port binding problems. By default, Spark binds to port 25925 (or a port you configured, see Spark Integration configurations). If this port is busy (another process is bound to it) or is not enabled in the first place, Incorta won’t be able to start Spark application.

Symptoms

You send a query to Spark
Returned error: “Connection error: [org.postgresql.Driver.connect]”

Solution

Possible causes

Another process is bound to the port
An earlier Spark application didn’t close cleanly (probably while stopping or restarting Incorta), thus, it has become a zombie process occupying the port but not useful since it’s not connected to the standing Incorta process
The port is not enabled

Potential fixes

Make sure the port is enabled
Check whether a Spark application is running and occupying the port by running the following command in the machine terminal netstat -tupln | grep 25925 # or the port number configured
If an instance is found, kill it using: kill PID # replacing PID with the process ID

Problem: Failure Due to Memory Shortage

Queries sent to Spark may fail due to out of memory problem in one of the executors.

Symptoms

Users run a query through Spark
Returned error: “Out of Memory Exception”

Side effects

Spark will kill the affected executor(s)

Solution

Possible causes

Spark configuration is not suitable to the queries being run, e.g.: executors are given too little memory to handle the query

Potential fixes

Adjust Spark executors memory configuration to a value suitable to the query being run

Problem: External Shuffle Service Is Not Enabled

While you’ve enabled dynamic allocation (elastic scaling) for Spark executors, queries running against Spark fail.

Symptoms

User sends a query to Spark which will be fulfilled by Spark
Query fails

Solution

Possible causes

External shuffle service is not running

Potential fixes

Run external shuffle service using: /path/to/spark/sbin/start-shuffle-service.sh

Problem: The Connection Attempt Failed

Incorta cannot connect to Spark.

Symptoms

Queries fails with “Connection attempt failed” error.

Solution

Possible causes

Spark machine host is cannot be resolved
The OS limit for number of processes is lower than required
The OS limit for number of open files is lower than required

Potential fixes

Make sure Spark master host is reachable, you may want to check /etc/hosts
Check the OS limits
For number of processes using: ulimit -u
For number of open files using: ulimit -n
If either is too small for the current workload, you need to increase it, in Ubuntu, the file controlling those value is located in: /etc/security/limits.conf

Problem: Missing Python Libraries error

If you receive an error message regarding missing Python libraries or modules you may need to install new libraries/modules. To resolve this issue:

# from pip /bin
sudo pip install <module>
# from miniconda /bin
sudo conda install <module>

Troubleshoot the Spark Environment

The following issues can occur with the Spark environment when you use Spark in Incorta.

Problem: All Materialized View jobs are failing

In Incorta UI, you see the following error (or similar):

Transformation error: INC_005005001:Failed to load data from spark://frc-incortatest05:7077 at <Materialized view> with properties [error, 2019-01-18 18:39:07 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable]

Solution

Try one of the following solutions to fix the issue:

Use the Incorta cluster management console (CMC) to see if the Spark cluster is running on the same server as Incorta.
Check whether Incorta and Spark can access the tenant folder. Log in to the Spark server machine to verify.
Check whether Incorta and Spark can connect to each other. Ping one from the other to find out.
Check that the ports, such as 7077, are open.
If a machine only supports IPv6, verify that Spark and Incorta can access each other using IPv6.
Check that you are using the same version of Spark that was shipped with your instance of Incorta. Run <Spark Home>/bin/spark-shell --master to verify.
The SPARK_HOME variable determines which spark installation will be used from a machine. Please ensure that it is set in the .bacs_profile file of the user which is used to install and start Incorta from the Incorta server. Please set to the user that is used to install and launch the Spark master and worker processes.
To use IPv6, set the variable in .bash_profile: export _JAVA_OPTIONS='-[Djava.net] (https://www.google.com/url?q=http://djava.net/&sa=D&ust=1560477646714000). preferIPv6Addresses=true'
Spark 2.3.0 and 2.3.1 require an additional fix to support IPv6:
1. Navigate to <Spark Home>/python/lib.
2. Create a folder named tmp.
3. Unzip the file py4j-0.10.7-src.zip.
4. View the java_gateway.py file.
5. Replace 127.0.0.1 with ::1
6. Zip the py4j folder back to the zip file.
7. Move the fixed zip file back.

Problem: You need to kill a Spark job

Solution

When you need to kill a materialized view Spark job that already started, kill the schema load job from the Incorta UI. If you kill the Spark job from the Spark Web UI, not not in Incorta, the driver process can continue running on the Incorta machine.

Problem: It is not clear if a Materialized View is running

Solution

In the Spark Web UI, you can see Spark jobs in two modes: WAIT or RUNNING. Running jobs produces log messages in the stderr file. Check the latest timestamp to see if the Spark job is running. If the Spark job is in WAITING mode, the job could be waiting for an available resource. Check that the materialized view defines the executor core, max core, and the executor memory. If the materialized view does not define the executor core, max core, and the executor memory, check the defaults defined in the spark-defaults.conf file. You can wait or adjust the definitions. Kill the Incorta schema job from Incorta UI, adjust the definitions, and try again.

Problem: Materialized View started, but is not visible in the Spark Web UI

You can see that the materialized view job started in Incorta UI, but there is no corresponding Spark job visible in Spark Web UI.

Solution

If you set the Always Compact option to off, materialized view jobs that show a “Started” status in the Incorta UI do not display in the Spark Web UI because Incorta is running compaction. To monitor compaction status, view the Incorta tenant log. The Incorta tenant log shows when the compaction started. Compaction builds indexes, which can take a long time for large tables. For compaction issues, try to add resources, like memory, to compaction jobs, and ensure that the Spill to Disk option is off.

Problem: Unhelpful Error Saving or Running a Materialized View

Solution

Where the error occurs determines how you address the issue:

An error occurs on the Incorta job and job history page. When Incorta displays a red plus sign in the loader UI, click on it to see the error message. If you do not see a red plus sign, or if you cannot click on it to view an error message, navigate to Schemas > <Schema> > Last Load Status > Select a Job > Check the Job Details and select the red plus sign.
An error displays in the Incorta tenant log. Use the grep command to extract the specific log entries for a schema table
Spark Web UI. The Spark Web UI runs on the same machine as the Spark master. Find the Spark master URL you are using by navigating in the CMC to Select Clusters > <cluster_name> > CMC > Spark Integration. In a browser, try navigating to http://<spark master host>:9091 to view the Spark Master Web UI. Click on the Application ID of the task you are debugging. Click on stdout to display the log file.
An error displays in the Spark Master and Spark log files from the Spark machine. The issue may be caused by an environment issue. For example, the worker process crashed or is not connected to master. Navigate to the Spark home > logs directory to see the log files.
Check if Spark executors are created in the Spark machine by using the following command: ps -ef | grep spark
Check if the Spark driver is created. Check that the Python program is on the Loader machine. For a new materialized view, save the new materialized view, then check that the Python program is on the Analytics machine.

Problem: Missing table

Solution

Spark MV and Incorta Spark SQL run against parquet files in the compacted folder under <Incorta Tenant directory>/compacted. Check the parquet files in the compacted folder if you are missing a table.

Problem: A Spark materialized view displays differently in the Spark UI and the Spark Web UI

Solution

Check the following areas:

Incorta Job
Spark Driver process
Spark executor process

A job may be created but not yet submitted to Spark.

Troubleshoot Spark

Configuration Issues

Problem: Disk Space

Problem: Spark Did Not Start

Problem: Failure Due to Memory Shortage

Problem: External Shuffle Service Is Not Enabled

Problem: The Connection Attempt Failed

Problem: Missing Python Libraries error

Troubleshoot the Spark Environment

Problem: All Materialized View jobs are failing

Problem: You need to kill a Spark job

Problem: It is not clear if a Materialized View is running

Problem: Materialized View started, but is not visible in the Spark Web UI

Problem: Unhelpful Error Saving or Running a Materialized View

Problem: Missing table

Problem: A Spark materialized view displays differently in the Spark UI and the Spark Web UI

Related Links