Skip to content

Quickstart Guide to Apache Spark, SparkSQL, and PySpark on WSL Ubuntu

Getting Started with PySpark on WSL Ubuntu

So you’ve heard about PySpark and you’re thinking, “Alright, I want to try this out.” Maybe you saw it in a job description, or someone on your team mentioned it during a meeting and you nodded like you knew what they were talking about (we’ve all been there).

This guide is for you. We’ll go through how to set up PySpark on your laptop using WSL Ubuntu. No need to spin up a massive cluster or set your hair on fire with Java errors. Just simple steps, running locally, with Python.

We’re gonna:

  • Install PySpark
  • Set up a clean virtual environment
  • Grab a dataset from Kaggle
  • Do some actual data exploration using DataFrames

That’s it. Nothing fancy. No complicated theory. Just practical stuff you can run and learn from.


What Is Spark and Why Should You Care?

Here’s the quick version. Spark is a distributed processing engine. It’s used when you’re working with large amounts of data and want things done fast. Like, really fast. Not “wait 10 minutes for your pandas join to finish” fast, we’re talking cluster-level speed.

The good news? You don’t need to know all the internals right away. What you do need to know are these:

  • Apache Spark is the engine
  • SparkSQL lets you write SQL queries inside Spark
  • PySpark is how you use Spark with Python
  • RDDs and DataFrames are how data moves around inside Spark

We’re sticking with PySpark for now, and we’re not going to install the full Spark runtime. Why? Because we want to keep things lean and easy. PySpark by itself is perfect for learning and running experiments locally.


What We’ll Be Working With

We’re using the “Most Streamed Songs” dataset from Spotify. It’s available on Kaggle, and it’s great for playing around with basic analysis.

We’ll do some quick data exploration and a few basic transformations using PySpark DataFrames. The idea is to get a feel for how Spark works without getting buried in setup or cluster configs.

And if you’re on macOS, don’t worry. Most of this still applies. I’ll drop a few side notes where things differ.


RDDs vs DataFrames (But We’re Picking a Side)

Let me put it simply.

  • RDDs are like raw building blocks. Super flexible, very powerful, but a bit of a pain if you’re just starting. They’re great if you want full control over how things are processed.
  • DataFrames are the friendlier version. They’re optimised, easier to write, and a lot closer to SQL or pandas. If you’re just trying to get stuff done, use DataFrames.

In this tutorial, we’re going with DataFrames. No point wrestling with RDDs unless you’ve got a specific need for them. Spark handles the tricky bits behind the scenes, so you can focus on writing logic and getting results.


Why We’re Using Just PySpark

Honestly? Because it’s enough. You don’t need to set up the full Apache Spark beast to start learning. PySpark is quick to install, works well locally, and lets you access nearly everything you’d use in production, from SQL-like queries to transformations and aggregations.

Later, when you’re ready to scale to clusters or run things in the cloud, you’ll already have a solid foundation.

Setting Up the Environment


Prerequisites

  • WSL Ubuntu: You need to have WSL running with Ubuntu already installed on your Windows system. This guide won’t walk through installing WSL itself, but if you’ve done that, you’re good to go.
  • Python 3: Python should already be installed inside your WSL environment. But just in case, here’s how to check and install it:

    sudo apt-get update
    sudo apt-get install python3

    Check the version to confirm it worked:

    python3 --version

Installing PySpark

Let’s begin by setting up a virtual environment and installing PySpark. Don’t worry, it’s not as daunting as it sounds – think of it as creating a sandbox where you can play around without messing up your main system.

Step 1: Create a Project Directory

Let’s make a folder to keep everything tidy. Open your terminal and run:

mkdir pyspark_shell
cd pyspark_shell

Step 2: Create and Activate Virtual Environment

We want to isolate this project so we don’t mess with other Python stuff on your system. To do that, we’ll use a virtual environment.

First, make sure you’ve got the tool to create one:

sudo apt-get install python3-venv

Then create and activate the environment:

python3 -m venv pysparkshell_env
source pysparkshell_env/bin/activate

Once it’s activated, you’ll see the environment name show up at the start of your terminal line (pysparkshell_env). That means you’re working inside the environment now.

Step 3: Install Java (Yep, Spark Needs It)

Spark runs on Java, so we’ve got to install that first.

Run this:

sudo apt-get install default-jdk

After that, you’ll need to let the system know where Java lives. Run this to set the JAVA_HOME path:

export JAVA_HOME=/usr/lib/jvm/default-java

This just helps Spark find what it needs to run properly.

Step 4: Install PySpark (Finally)

Before we go for PySpark, there’s one quick thing. Sometimes it complains about not finding the right wheel file. Easy fix. Install this first:

pip install wheel

Then go ahead and install PySpark:

pip install pyspark

That should pull in everything you need. To double-check it worked, run:

pyspark --version

You should see the version info printed out. If so, you’re ready to start writing some code.

A Quick Note for macOS Users

If you’re on macOS, most of these steps are pretty much the same. The main difference is you’ll probably use Homebrew to install Python or Java instead of apt-get. But once Python and Java are sorted, you can follow the same virtual environment and PySpark installation steps

Downloading the Dataset from Kaggle

For this little project, we’re using a dataset called “Most Streamed Songs” from Spotify. It’s available on Kaggle, and it’s a good one for beginners. Clean format, not too massive, and has just enough detail to do something interesting with it.

If you’ve already got a Kaggle account, great. Head over there, search for the dataset, and download it. If you don’t, you’ll need to sign up first. It’s free, but yeah, it’s one more thing to do.

To save you a bit of time, we’ve also included the CSV file right here in the post. So if you just want to skip the whole Kaggle step, go ahead and grab the file directly. No login. No token. No drama.

👉 [Download the Spotify Most Streamed Songs CSV]
Just click, save it, and you’re done.

Once you’ve got the file, drop it into your project folder. That’s the pyspark_shell directory you created earlier. Keeping it all in one place makes life easier when we start writing code.

Firing Up PySpark and Getting the Data In

Alright, you’ve made it this far. PySpark’s installed, the dataset is sitting in your project folder, and you’re ready to run some code. Time to open up the PySpark shell and start exploring.

Step 1: Launch the PySpark Shell

To kick things off, open your terminal and just type:

pyspark

That’ll launch the interactive PySpark shell. It’s kind of like a Python shell, but for Spark. You can type commands in there and see results right away.

Step 2: Load the Dataset

Now that you’re inside the shell, let’s load the CSV file into a DataFrame.

Think of a DataFrame like an Excel sheet, but with superpowers. You can slice it, filter it, query it, and Spark handles the heavy lifting for you.

Paste this into your shell:

# Load the CSV dataset
songs_df = spark.read.csv("Spotify_Most_Streamed_Songs.csv", header=True, inferSchema=True)

# Show the first few rows of the DataFrame
songs_df.show(5)

Here’s what’s happening:

  • We’re reading the CSV into a DataFrame called songs_df
  • header=True makes sure the column names are included
  • inferSchema=True lets Spark figure out the data types on its own
  • show(5) displays the first five rows so we can get a quick look

Step 3: Perform Basic Data Analysis

Let’s keep things simple but useful. We just want to understand what this data looks like and maybe run a few queries.

3.1: Inspect Schema

Want to know what columns you’re dealing with and what types they are?

Run:

songs_df.printSchema()

3.2: Count Number of Records

Quick question — how many records are there?

Just run:

print(f"Total records: {songs_df.count()}")

3.3: Basic Queries Using SparkSQL

Spark lets you write SQL too. First, register the DataFrame as a temporary view, then run a query to find the top five artists with the most streams:

songs_df.createOrReplaceTempView("songs")

# Example query to find the top 5 artists with the highest number of streams
artist_streams = spark.sql("SELECT `artist(s)_name`, SUM(streams) AS total_streams FROM songs GROUP BY `artist(s)_name` ORDER BY total_streams DESC LIMIT 5")
artist_streams.show()

That query does exactly what it says. It groups songs by artist name, sums the streams, sorts them in descending order, and shows the top five.

You just ran SQL inside Spark. Not bad, right?

What We’ve Done So Far

Let’s take a step back and see what we’ve covered:

  • Set up a clean virtual environment and installed PySpark
  • Downloaded a dataset and saved it to your project folder
  • Opened the PySpark shell and loaded the data into a DataFrame
  • Ran some basic commands and even a SQL query to explore the data

This is already a solid foundation. You’ve got a working setup and real results. From here, you can try filtering by year, plotting results with Python tools outside Spark, or even exploring simple machine learning models later if that’s your thing.

Published inData Engineering