Member-only story

Reduce data breaches with Pyspark

Che Kulhan
4 min readNov 3, 2022

--

There has recently been a plethora of data breaches in the news, such as Optus and Medicare to name two Australian cases. While protecting or preventing against data breaches requires many facets of cybersecurity, such as privacy, access policies, training, phishing protection, etc… data engineers may find masking or encrypting personal identifiable information a valuable technique, available at a column level on a dataset, using a few Pyspark functions and logic.

Image courtesy of https://money.com/what-is-a-data-breach/

Introduction

By starting with a simple, manually-made dataframe, we can practice some basic techniques, become familiar with the functions and logic used for masking and encrypting, before applying them to real-world datasets.

The aim of this article is to mask or encrypt email addresses, given that emails can be related to a particular user, and are considered personally identifiable information (PII) under privacy acts such as General Data Protection Regulation (GDPR). EDITORS NOTE: If you would like to copy/paste this code, please see the Resources section at the end of this article.

Initial code setup

lit() function for masking

The simplest and easiest way to mask or encrypt data would be through the use of the lit() function, replacing the email value with a given string value, “***Masked***” in this case. In the following example, NULL values are not masked, by using the when() condition:

conditions_mask = when(col("email").isNotNull(), lit("***Masked***")).otherwise(col("email"))df_emails = df_emails.withColumn("email", conditions_mask)df_emails.show(5, False)
Masking using lit()

I personally like this technique as it is simple and fast, yet allows you the flexibility to apply different conditions on the data i.e if @ symbol within string, then apply mask.

A better mask function

--

--

Che Kulhan
Che Kulhan

No responses yet

Write a response