Member-only story

Change column values based on conditions in PySpark

Che Kulhan
3 min readJun 22, 2022

--

When() and otherwise() functions can be used together rather nicely in PySpark to solve many everyday problems. This article demonstrates a neat technique focussing on code readability and maintainability, by separating the condition from its application.

Problem statement. Given the following company information, we have been asked to produce a report displaying the total number of companies in each sector. However, to make things more complex, you have been asked to:

  • Change the “Finance” description to “Financial Services”
  • If the sector is “n/a”, the description should be “No sector available”
Initial report without the business changes

First of all, let’s import the required Pyspark functions and get some CSV data and read it into a dataframe. If you are using Google Colab, you will find this code very familiar and easy to use.

from pyspark.sql.functions import when, col, count!curl https://raw.githubusercontent.com/modakanalytics/tutorials/master/example/sample_data/companies.csv --output companies.csvcompanies_df = spark.read.option("header",True).csv("companies.csv")

--

--

Che Kulhan
Che Kulhan

No responses yet