In this tutorial, I’ll guide you through the process of setting up and working with PySpark to perform data analysis on a dataset from Kaggle. This guide is designed for beginners looking to get hands-on experience with PySpark. We will be using the WSL Ubuntu environment, focusing on practical steps to get started. You will learn how to install PySpark, set up a virtual environment, and run scripts that can be easily replicated.
Introduction to Apache Spark, SparkSQL, PySpark, RDDs, and DataFrames
Apache Spark is an open-source, distributed processing system designed to handle big data workloads efficiently, focusing on analyzing vast amounts of data at scale. SparkSQL, PySpark, RDDs, and DataFrames are key components that allow users to interact with data effectively.
- Apache Spark: A distributed computing framework that provides the ability to process big data in parallel.
- SparkSQL: The SQL module of Apache Spark that allows querying structured data using SQL.
- PySpark: The Python API for Apache Spark, allowing us to use the capabilities of Spark using Python. In this tutorial, we will only use PySpark, without installing the full Spark distribution, because we aim to focus on the practical use of Spark through Python scripting.
Project Overview
Our project will focus on analyzing the “Most Streamed Songs” dataset from Spotify, which can be downloaded from Kaggle. We’ll dive into data exploration and conduct some straightforward analytics using PySpark, making the most of its powerful capabilities to get meaningful insights. The environment used is WSL Ubuntu for Windows users, but I will also include suggestions for macOS users.
What Are RDDs and DataFrames?
Before diving into the analysis, let’s touch upon two key concepts in Apache Spark: RDDs and DataFrames.
- RDD (Resilient Distributed Dataset): RDD is the core data structure of Apache Spark, designed to handle distributed data efficiently and allow parallel processing. Think of it as a collection of data that is distributed across a cluster and can be processed in parallel. RDDs are great for low-level operations and full control, but they can be a bit of a handful to manage for beginners.
- DataFrames: DataFrames are a higher-level abstraction over RDDs, offering optimised querying capabilities and ease of use, much like working with tables in SQL. In this tutorial, we’ll use DataFrames because they are much easier to work with, and Spark handles a lot of the heavy lifting under the hood. Who needs the stress of managing low-level operations when you can let Spark do the work, right?
By focusing on DataFrames, we’ll simplify our code and get results faster, without worrying too much about what’s happening in the background. Think of RDDs as the foundation and DataFrames as the comfy house built on top.
Why PySpark Only?
Instead of installing the full Apache Spark, we use PySpark because it is easier to set up, especially when working in a local development environment. PySpark gives you direct access to Spark’s features using Python, making it ideal for learning Spark without a complex distributed setup. Since our primary focus is learning, a local setup using PySpark suffices.
Setting Up the Environment
Prerequisites
- WSL Ubuntu: Make sure WSL (Windows Subsystem for Linux) is already set up on your Windows machine. This guide will not cover the installation of WSL Ubuntu.
- Python 3: Python 3 must be installed at the system level in WSL. If Python 3 is not installed yet, you can install it by running:
sudo apt-get update
sudo apt-get install python3
Once installed, you can confirm by running this command:python3 --version
Installing PySpark
Let’s begin by setting up a virtual environment and installing PySpark. Don’t worry, it’s not as daunting as it sounds – think of it as creating a sandbox where you can play around without messing up your main system.
Step 1: Create a Project Directory
First, let’s create a directory where all our project files will be stored. Open your terminal and execute the following commands to get started. Don’t worry, it’s like organising your desk – keeping all the files in one place makes life easier!
mkdir pyspark_shell
cd pyspark_shell
Step 2: Create and Activate Virtual Environment
Next, we’ll set up a virtual environment to keep our project dependencies neatly contained.
Make sure the python3-venv
package is installed before setting up the virtual environment. This package provides the tools needed to create virtual environments. Run this command to install it:
sudo apt-get install python3-venv
python3 -m venv pysparkshell_env
source pysparkshell_env/bin/activate
This will create and activate a virtual environment named pysparkshell_env
. You should see (pysparkshell_env)
at the start of your command line, indicating that the environment is active.
Step 3: Install Java
Before installing PySpark, we need to install Java, as it is required for running Spark. Use the following command to install Java:
sudo apt-get install default-jdk
Once Java is installed, set the JAVA_HOME
environment variable to help PySpark locate the Java installation.
export JAVA_HOME=/usr/lib/jvm/default-java
This command will install the default Java Development Kit (JDK), which provides the Java runtime environment needed for PySpark.
Step 4: Install PySpark
Now that the virtual environment is ready and Java is installed, we can install PySpark via pip
. Sometimes, you may encounter an error regarding bdist_wheel
when installing PySpark. To avoid this, first install wheel
:
pip install wheel
After installing wheel
, proceed to install PySpark: It’s like adding the final ingredient to our recipe – time to bring it all together!
pip install pyspark
This command will install PySpark and its dependencies. To verify the installation, run:
pyspark --version
You should see information about the PySpark version that has been installed.
macOS Users
If you’re using macOS, the steps are largely similar. Open your terminal and follow the same steps for creating a project directory, virtual environment, and installing PySpark. Make sure that Homebrew is installed to manage packages like Python if needed.
Downloading the Dataset from Kaggle
Our project will use the “Most Streamed Songs” dataset from Spotify, which can be downloaded from this link. You will need to create an account on Kaggle if you haven’t already.
To make things easier for you (because who doesn’t like a shortcut?), we are also providing the CSV file attached with this post, so you can download it directly without needing to go through Kaggle.
Once you have the file, place the CSV in your project directory (i.e., the pyspark_shell
directory).
Starting PySpark Shell and Loading the Dataset
Once PySpark is installed and your dataset is ready, we can begin analyzing it using PySpark Shell.
Step 1: Open PySpark Shell
To open the PySpark shell, simply run:
pyspark
This command will open the PySpark interactive shell, where you can run Spark commands interactively.
Step 2: Load the Dataset
First, we’ll load the dataset into a Spark DataFrame – think of a DataFrame like an Excel spreadsheet, but cooler and way more powerful. Enter the following commands in the PySpark shell:
# Load the CSV dataset
songs_df = spark.read.csv("Spotify_Most_Streamed_Songs.csv", header=True, inferSchema=True)
# Show the first few rows of the DataFrame
songs_df.show(5)
In this code:
- We load the CSV dataset into a DataFrame named
songs_df
usingread.csv()
withheader=True
to include column names andinferSchema=True
to infer data types. - Finally, we use
show(5)
to display the first few rows of the DataFrame.
Step 3: Perform Basic Data Analysis
Now that we have the data loaded, let’s do some basic data analysis to understand its structure.
3.1: Inspect Schema
To inspect the schema of the dataset (i.e., column names and data types), run:
songs_df.printSchema()
3.2: Count Number of Records
To count the number of records in the dataset:
print(f"Total records: {songs_df.count()}")
3.3: Basic Queries Using SparkSQL
We can also register the DataFrame as a temporary view to use SQL queries for data analysis:
songs_df.createOrReplaceTempView("songs")
# Example query to find the top 5 artists with the highest number of streams
artist_streams = spark.sql("SELECT `artist(s)_name`, SUM(streams) AS total_streams FROM songs GROUP BY `artist(s)_name` ORDER BY total_streams DESC LIMIT 5")
artist_streams.show()
This code uses createOrReplaceTempView
to register the DataFrame as a temporary view named songs
, which allows us to run SQL queries against the DataFrame. The example query finds the top 5 artists with the highest number of streams.
Summary
In this tutorial, we:
- Set up a Python virtual environment and installed PySpark on WSL Ubuntu.
- Downloaded a dataset from Kaggle and loaded it into PySpark.
- Used PySpark and SparkSQL to perform some basic data analysis.