Member-only story

Implement a random test case on millions or billions of data rows using PySpark

Che Kulhan
3 min readOct 29, 2022

--

PySpark’s sampling method allows you to retrieve a (random) subset of the data for further analysis. This article provides some simple examples to get you started and then describes a real-world business use case for testing big datasets.

Image courtesy of https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/

Using functionality such as dataframe.show() is fine for extracting data for simple analysis. Nevertheless, there are often times when you require a good, thorough random sample of data, which is where dataframe.sample() becomes useful.

Introduction

Grab some data and load it into a dataframe:

!curl https://raw.githubusercontent.com/vamsikrishnaprasad/predictive-Analytics-for-Retail-Banking/master/bank.csv --output bank.csvbank_df = spark.read.option("header",True).csv("bank.csv")

If we execute the following line of code multiple times, we receive the exact same rows every time:

bank_df.show(5, False)

By using the sample() function, and passing in a fraction value between 0 and 1, to indicate the (approximate) percentage of rows to return, every time you execute the line, new rows and a different order are returned:

FRACTION = 0.1 # sampling 10% of rows

--

--

Che Kulhan
Che Kulhan

No responses yet