Install Apache Spark on Windows 10

Some people think that installing Apache Spark on your PC is rocket science. To be honest, it is not. Believe us, by the end of this article you will know how easy it is to install Apache Spark as this article will discuss the easy step-by-step guide on how to install Apache Spark on Windows 10.

Apache Spark Prerequisites

Before we start, below are some prerequisites that you need to have:

Admin permitted user account,
Java Installed,
Python Installed,
7-Zip or any tool to extract tar files, and
Windows 10.

Once you make sure that you have all the prerequisites, you can proceed with installing Apache Spark on Windows 10.

A Step-by-Step Guide to Install Apache Spark on Windows 10

Step 1. Installing Java

Apache Spark requires at least Java 8 to install. Ensure that you have the latest or at least Java 8 on your PC. To check, you can type the command below and hit enter:

java -version

It will display the installed java version on your PC. In case you don’t have java installed on your PC, you can install it from the official Java website.
Click on the Java Download button to go to the downloads section. Hit the Agree and Start Free Download button and specify the path you want to download the Java setup.
Once downloaded, you just need to double-click it and install it to your preferred location.

Step 2. Installing Python

Python is another requirement for using Apache Spark on your Windows 10 PC. Installing Python is easy, just follow the steps below:

Just head to the official Python website.
Scroll down to the very end, and under the downloads section click on Windows.
On the webpage you will see the two Python versions varying as Python 2 and Python 3.
Head to the latest Python 3 release (recommended).
Here, scroll down to the Files heading. Here you will see the list of files available to download.
Check the operating system and your OS compatibility for 32-bit or 64-bit from the table and download Windows installer by clicking the hyperlink.
Run Python setup by double-clicking the file you just downloaded.
From the first screen, check off the ‘Add Python to PATH’.
Then continue installation to your preferred location by clicking on Customize installation. Alternatively, use the default location by clicking install now.
Now select the box saying “Install for all users” and click install.
Once the installation is complete, click on the ‘Disable path length limit’ option then close it.
Now to check, if Python is installed correctly or not, you can head to the command prompt. Type the following command and then hit enter:

python –version

The installed python version should be displayed. It implies that you installed Python correctly.

Step 3. Download Apache Spark

To download Apache Spark, you need to head to the website www.spark.apache.org.

Head to the Downloads section by clicking on Download Spark.
Select the latest version of Apache Spark from the Choose a Spark release dropdown menu.
Next, select Pre-built Apache Hadoop from the Choose a package type dropdown menu.
Click the hyperlink following Download Spark. A page will open up with mirror links to download the Apache Spark.
You can select any mirror server and continue to download Spark.

Step 4. Verify Apache Spark Files

In this step, we will verify the integrity of the download via checksum. This step is necessary and will ensure if any of the software files are corrupted or not.

Get back to the Download page and open the checksums link in the fourth point in a new tab.
Now open the command prompt and type the below command then hit enter:

certutil -hashfile c:\users\Alex\Downloads\spark-2.4.5-bin-hadoop2.7.tgz SHA512

[Here you need to replace ‘Alex’ with your PC’s username and Spark version as per your downloaded version]

Now you will see an alphanumeric hashcode on the cmd window.
Match the hashcode with the hashcode shown on the website. If they match, then you are good to go as no file is corrupted.
If they are not matching, then you need to download Spark again by following Step 3.

Step 5. Install Apache Spark

To install Apache Spark, you need to extract the downloaded file. Here’s how to do it:

Head to the C drive or system drive and create a new folder, say ‘Spark’. [You can choose any other name if you like.]
Now open the folder where you downloaded the file.
Right-click on the file and extract it to C:\Spark using 7-zip or any similar compressed file extraction tool.
Now head to C:\Spark\, here you will find another folder with all the necessary Spark files inside it.

Step 6. Other Necessary File

You need to download the winutils.exe for further proceeding. Follow the steps below to download winutils.exe:

Just head to https://github.com/cdarlint/winutils.
Here, seek the latest version of winutlis.exe and open the bin folder.
From here download the winutils.exe file.
Now open C:\ and create a folder called ‘hadoop’.
Create a folder named bin in it.
Now move the downloaded winutils.exe file to C:\hadoop\bin.

Step 7. Set Environment Variables

Now you just need to set the environment variables to start using Apache Spark. So now to set the path, just follow the steps below:

Right-click ‘This PC’ from the desktop and head to its properties.
Then tap the advanced system settings. A pop-up will appear.
Click on Environment Variables.
Under the System Variables section, hit the ‘New’ button.
Now set the variable name as Spark or SparkHome(Anything you like).
Then for Variable value, type – C:\Spark\spark-2.4.5-bin-hadoop2.7 and hit OK.

[Replace the version above as per your installed version and if you have created your custom installation then set your file’s location.]

Now again under System variables select path and click on edit.
A new dialogue box will appear, click on New.
Type ‘%Spark%\bin’ or type your full path. We recommend using %Spark%\bin to avoid path-related errors.

[Replace Spark with the name of the folder you used in the above steps.]

Now we have to do the same process with Hadoop and Java as well.
Create a new Environment under System variables.
Name it Hadoop or anything to identify and set the variable value as C:\hadoop\bin.
Now select the path under system variables and edit it by adding a new path as %Hadoop%\bin.
Getting back to System variables and adding Java to the environment. Use C:\Program Files\Java\jdk1.8.0_251 as a path if you used the default location or else enter the path of the folder containing jdk.

Conclusion

This article was all about installing and setting up Apache Spark on your Windows 10. Hope this guide was helpful. Confused about something written here? Something wrong? Let us know in the comments.