Member-only story
Change column values based on conditions in PySpark
When() and otherwise() functions can be used together rather nicely in PySpark to solve many everyday problems. This article demonstrates a neat technique focussing on code readability and maintainability, by separating the condition from its application.
Problem statement. Given the following company information, we have been asked to produce a report displaying the total number of companies in each sector. However, to make things more complex, you have been asked to:
- Change the “Finance” description to “Financial Services”
- If the sector is “n/a”, the description should be “No sector available”
First of all, let’s import the required Pyspark functions and get some CSV data and read it into a dataframe. If you are using Google Colab, you will find this code very familiar and easy to use.
from pyspark.sql.functions import when, col, count!curl https://raw.githubusercontent.com/modakanalytics/tutorials/master/example/sample_data/companies.csv --output companies.csvcompanies_df = spark.read.option("header",True).csv("companies.csv")