Member-only story
Reduce data breaches with Pyspark
There has recently been a plethora of data breaches in the news, such as Optus and Medicare to name two Australian cases. While protecting or preventing against data breaches requires many facets of cybersecurity, such as privacy, access policies, training, phishing protection, etc… data engineers may find masking or encrypting personal identifiable information a valuable technique, available at a column level on a dataset, using a few Pyspark functions and logic.
Introduction
By starting with a simple, manually-made dataframe, we can practice some basic techniques, become familiar with the functions and logic used for masking and encrypting, before applying them to real-world datasets.
The aim of this article is to mask or encrypt email addresses, given that emails can be related to a particular user, and are considered personally identifiable information (PII) under privacy acts such as General Data Protection Regulation (GDPR). EDITORS NOTE: If you would like to copy/paste this code, please see the Resources section at the end of this article.