Configure Spark to Work With Hadoop on Windows

Configure Spark to Work with Hadoop on Windows

If you plan to use Spark on Windows, you must perform the following steps to configure Spark. This allows you to use Spark with Hadoop on your Windows machine. If you use a Linux machine, you do not need to perform these or any additional steps to use Spark.

Requirements:

  • VC++2015
  • Spark without Hadoop
  • Hadoop 3.2

To configure external Spark to work with Hadoop on Windows:

  1. Install Incorta, but do not start the Cluster Management Console (CMC).
  2. Copy winutils.exe and hadoop.dll to the bin folder of Hadoop 3.2.
  3. Set the HADOOP_HOME environmental variable to the Hadoop 3.2 folder.
  4. Add to the PATH environmental variable: %HADOOP_HOME%/bin
  5. In the terminal, browse to the Hadoop 3.2 bin directory then run Hadoop classpath.
  6. Copy the classpath value to a text file.
  7. Run the hostname.
  8. Copy the hostname value to a text file. You will use it as the hostname.
  9. Copy the output to spark-env.sh which should look like this:
{{set SPARK_PUBLIC_DNS=(hostname) }}
{{set SPARK_MASTER_IP=(hostname) }}
{{set SPARK_MASTER_PORT=7077 }}
{{set SPARK_MASTER_WEBUI_PORT=9091 }}
{{set SPARK_WORKER_PORT=7078 }}
{{set SPARK_WORKER_WEBUI_PORT=9092 }}
{{set SPARK_WORKER_MEMORY=8g }}
set SPARK_DIST_CLASSPATH=(value of hadoop classpath copied as is)
  1. In the sbin folder of spark 2.4.3, create two cmd files:

    • Name: start-master.cmd. Content: ../bin/spark-class org.apache.spark.deploy.master.Master
    • Name: start-slave.cmd. Content: ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://(hostname):7077.
  2. Run start-master.cmd.
  3. Run start-slave.cmd.
  4. In the CMC, install the loader and analytics service.
  5. In Spark, select the external version and use spark://(hostname):7077 as the master.

© Incorta, Inc. All Rights Reserved.