Member-only story

Reduce data breaches with Pyspark

Che Kulhan
4 min readNov 3, 2022

--

There has recently been a plethora of data breaches in the news, such as Optus and Medicare to name two Australian cases. While protecting or preventing against data breaches requires many facets of cybersecurity, such as privacy, access policies, training, phishing protection, etc… data engineers may find masking or encrypting personal identifiable information a valuable technique, available at a column level on a dataset, using a few Pyspark functions and logic.

Image courtesy of https://money.com/what-is-a-data-breach/

Introduction

By starting with a simple, manually-made dataframe, we can practice some basic techniques, become familiar with the functions and logic used for masking and encrypting, before applying them to real-world datasets.

The aim of this article is to mask or encrypt email addresses, given that emails can be related to a particular user, and are considered personally identifiable information (PII) under privacy acts such as General Data Protection Regulation (GDPR). EDITORS NOTE: If you would like to copy/paste this code, please see the Resources section at the end of this article.

Initial code setup

lit() function for masking

--

--

Che Kulhan
Che Kulhan

No responses yet