Everyone can start to develop pipelines with Apache Beam in the cloud

Che Kulhan
3 min readMar 4, 2022

--

Google Colab provides us with a notebook to start developing ETL pipelines with Apache Beam, an open-source programming model used with Google’s DataFlow cloud product.

Apache Beam

Practical Introduction

Firstly install the Apache Beam open-source, programming model via the following pip install in your Google Colab notebook:

!pip install apache-beam

Then get some CSV data using the curl command. I’ve chosen some film data from a github repository:

!curl https://raw.githubusercontent.com/Rajeshwari-Rudra/apache_beam-python/main/netflix_titles.csv --output netflix.csv

Now let’s create a simple pipeline which ingests or reads the CSV file now located in my cloud directory and saves it as another file. Beam’s Pipeline driver program is responsible for all the steps in your data processing pipeline:

import apache_beam as beampipeline = beam.Pipeline()netflix = (
pipeline
| beam.io.ReadFromText("netflix.csv", skip_header_lines=1)
| beam.io.WriteToText("results.txt")
)
pipeline.run()

As you can see, this pipeline doesn’t yet add any value to the data. There is no data transformation or filtering. Nonetheless, a basic program structure is demonstrated showing how easy it is to begin ingesting input data and saving data as output.

Transformations

Now we would like to apply some type of transformation, such as filtering the data by movie type (column 2 in the CSV file). Effectively, I’d like to create a query to show only the rows which are of type “Movie”.

Apply filters on the Comma Separated Values (CSV) data

To do this, we will add 2 transformations between the ReadFromText and WriteToText. Firstly, the Map function will apply Python’s split() function to each and every line of the CSV file, thereby turning each line effectively into a Python List object. Secondly, Beam’s Filter transformation will do just that, query the second column (remember Python List elements start counting from zero) and include only the values which are equal to “Movie” type.

import apache_beam as beampipeline = beam.Pipeline()netflix = (
pipeline
| beam.io.ReadFromText("netflix.csv", skip_header_lines=1)
| beam.Map(lambda line:line.split(","))
| beam.Filter(lambda line:line[1] == "Movie")

| beam.io.WriteToText("results.txt")
)
pipeline.run()

We can make a filter a little more intuitive by creating functions to search for specific movies, ratings or release year, as I have done in the following example, which definitely makes the code easier to read and therefore manage:

SHOW_ID = 0
TYPE = 1
RELEASE_YEAR = 2
RATING = 3
def is_ReleaseYear(film):
return film[RELEASE_YEAR] == "2020"

To filter using this function, rather than using Python’s lambda function, use the following in the pipeline transformations:

  | beam.Map(lambda line:line.split(",")) 
| beam.Filter(is_ReleaseYear)
| beam.Map(print)

The official Apache Beam website has some great examples using in-memory data where they filter using the same principles as above on JSON data, making it even easier to read and great to get some simple examples going.

Conclusion

This article didn’t delve into the details of the PCollection object, which is akin to a dataframe in PySpark. Nonetheless, we have already used them without realising it within our pipeline and would be the next theoretical steps in learning Apache Beam.

The question I have is should I dedicate my time and resources to implement pipelines with Apache Beam and therefore GCP’s DataFlow, or do I have enough resources and skills under my belt with other toolkits such as Apache PySpark and GCP’s DataProc? Let me know what your opinion is…

References:

https://beam.apache.org/documentation/programming-guide/

--

--