Member-only story
Implement a random test case on millions or billions of data rows using PySpark
PySpark’s sampling method allows you to retrieve a (random) subset of the data for further analysis. This article provides some simple examples to get you started and then describes a real-world business use case for testing big datasets.
Using functionality such as dataframe.show() is fine for extracting data for simple analysis. Nevertheless, there are often times when you require a good, thorough random sample of data, which is where dataframe.sample() becomes useful.
Introduction
Grab some data and load it into a dataframe:
!curl https://raw.githubusercontent.com/vamsikrishnaprasad/predictive-Analytics-for-Retail-Banking/master/bank.csv --output bank.csvbank_df = spark.read.option("header",True).csv("bank.csv")
If we execute the following line of code multiple times, we receive the exact same rows every time:
bank_df.show(5, False)
By using the sample() function, and passing in a fraction value between 0 and 1, to indicate the (approximate) percentage of rows to return, every time you execute the line, new rows and a different order are returned:
FRACTION = 0.1 # sampling 10% of rows