Member-only story
Reduce data breaches with Pyspark
There has recently been a plethora of data breaches in the news, such as Optus and Medicare to name two Australian cases. While protecting or preventing against data breaches requires many facets of cybersecurity, such as privacy, access policies, training, phishing protection, etc… data engineers may find masking or encrypting personal identifiable information a valuable technique, available at a column level on a dataset, using a few Pyspark functions and logic.

Introduction
By starting with a simple, manually-made dataframe, we can practice some basic techniques, become familiar with the functions and logic used for masking and encrypting, before applying them to real-world datasets.
The aim of this article is to mask or encrypt email addresses, given that emails can be related to a particular user, and are considered personally identifiable information (PII) under privacy acts such as General Data Protection Regulation (GDPR). EDITORS NOTE: If you would like to copy/paste this code, please see the Resources section at the end of this article.

lit() function for masking
The simplest and easiest way to mask or encrypt data would be through the use of the lit() function, replacing the email value with a given string value, “***Masked***” in this case. In the following example, NULL values are not masked, by using the when() condition:
conditions_mask = when(col("email").isNotNull(), lit("***Masked***")).otherwise(col("email"))df_emails = df_emails.withColumn("email", conditions_mask)df_emails.show(5, False)

I personally like this technique as it is simple and fast, yet allows you the flexibility to apply different conditions on the data i.e if @ symbol within string, then apply mask.