Share PySpark data via JSON (and Pandas)

5 min readMar 24, 2022

You’ve got some big data which you have filtered and transformed into something “manageable” and would like to display it on a web page or mobile phone. So how can you get the data into a format which can then be consumed via a web service or another application? The answer is JSON … or is it Pandas … or both?

Image courtesy of https://www.bigdataschool.ru/blog/rest-api-call-on-spark-with-python-udf.html

JavaScript Object Notation (JSON) is currently the de-facto file format for sharing and communicating data on the web, so it can be used to easily send data via REST from one application to another application, including data that comes from big data repositories such as Hadoop and Hive via PySpark. In this article, I will discuss a number of options available to convert PySpark data to JSON.

If you are following along using Google Colab or a similar notebook, grab some demo CSV data like this:

!curl https://raw.githubusercontent.com/vamsikrishnaprasad/predictive-Analytics-for-Retail-Banking/master/bank.csv — output bank.csv

and then read it into a PySpark dataframe:

bank_df = spark.read.option("header",True).csv("bank.csv")

Firstly, I would like to transform the dataset that I am using into something more manageable. In this case, I have executed a transformation to display the number of occurrences of jobs. Using a groupBy() clause and also renaming the jobs field to occupations, provides variety and interest with the transformation 😄.

field = "job"jobs_df = (bank_df
  .groupBy(col(field))
  .count()
  .select(col(field).alias("occupation"), col("count"))
  )jobs_df.show(10, False)

The results are as follows:

Using Pandas to_json()

And what do I mean by manageable? Well, according to Apache PySpark, the toPandas() method “should only be used if the resulting Pandas’s DataFrame is expected to be small, as all the data is loaded into the driver’s memory.” In any case, a web and mobile app probably wouldn’t want to display millions of rows of data to a user, as the usability of the site (UX) would not produce satisfactory results, unless more complicated mechanisms were used to display only a limited number of results at a time.

Now that the PySpark results are manageable (i.e. not too many rows), we can convert the PySpark data into a pandas dataframe and then convert the pandas dataframe into a JSON formatted string. Here is the workflow:

PySpark >> Pandas >> JSON

pandas_df = jobs_df.toPandas()
s = pandas_df.to_json()
print(s)

What do you notice about the results? (By pasting the results into a JSON viewer such as https://jsonformatter.org/json-viewer or https://codebeautify.org/jsonviewer, the JSON notation is displayed in an easy-to-read graphical format). That’s right! All the rows of occupations are displayed first, and then their corresponding counts. What I really need is to display the data in another format, to make it easy to traverse and display, such as in rows of a table.

By adding the orient parameter to the to_json function, the results are as follows:

pandas_df = jobs_df.toPandas()
s = pandas_df.to_json(orient="records")
print(s)

The JSON data could also be used as a RestAPI response object, and then be processed within a website or mobile application by easily using Python’s requests module. A simple demonstration of the post method can be found here.

Once in a website, the results can be retrieved from the REST request in an array data structure, which makes it much easier to traverse, such as using a forEach() loop in Javascript, as can be seen in the following image which uses the Chrome DevTools Inspect console to loop and print all the elements in the array.

Chrome DevTools to demonstrate a loop over the JSON results

Using Pandas to_html()

Another option is to convert the dataset directly to a HTML table, using the to_html() function. This has the disadvantage of being clumsy and difficult to transfer the rendered HTML data with application communication API’s such as Restful, nonetheless does allow easy and readable access to a HTML Table.

Using RDDs

Even another option is to use the toJSON() function of a dataframe. Surprisingly, Apache Spark mention that the returned object is not a dataframe but a Resilient Distributed Datasets or RDD: “Each row is turned into a JSON document as one element in the returned RDD”. This has the disadvantage of being slower and requires the developer to now work with a different abstraction of the data (i.e. not a dataframe).

rdd = jobs_df.toJSON()
result = rdd.collect()
print(result)

In addition, the returned JSON contained some addition commas which made it impossible to view using the web’s JSON Viewers. This could create difficulties further down the line when creating a RESTful application.

pyspark.sql.functions

Finally, this library also contains a JSON helper function to convert data into a JSON format, within the dataframe:

Conclusion

This article aimed to demonstrate a number of options of using JSON to share big data to the web and introduced a number of tools at our disposal, such as Google Colab, JSON viewers and Chrome’s Dev Tools.

Do you know of other ways to share big data?