Introduction to big data with Google’s BigQuery

Che Kulhan
4 min readFeb 27, 2022

--

Google produced a whitepaper way back in 2010, describing its experiences with “massively parallel computing done using shared clusters of commodity machines.” This white paper introduced us to the Dremel system, built on Google’s experiences with web search and parallel database management systems (DBMS). However, what really interested me, as a user of data, is the ability to use a high-level SQL-like language to carry out queries on datasets with gigabytes or terabytes of data. Fortunately, nowadays we all have access to Dremel’s technology through Google’s BigQuery tool.

Dremel inspired BigQuery

Google BigQuery has a sandbox which is free to use without providing a credit card or using a billing account. The video describes how to access the GCP (Google Cloud Platform).

One the best ways to get accustomed to BigQuery is to use some of the public datasets available via Google Cloud Public Dataset Program.

“A public dataset is any dataset that is stored in BigQuery and made available to the general public.”

This program allows you to access data such as Google trends or analytics, or other public data such as weather, census, crime or financial information including cryptocurrencies. For this article, I will be querying International Debt data sources from the World Bank.

Explore public datasets in BigQuery

Before carrying out a query on a dataset, I would suggest spending a few moments getting yourself accustomed to the schema, where you can find information about each of the columns, number of rows and table created date, as well as a sample preview of the data. The preview saves you from having to write a SELECT * type query to view the first few rows of the dataset, especially important when having to consider analysis pricing models in business contexts.

View the schema, details and sample data before carrying out a query

One of the aspects of the BigQuery interface that you will also notice at this point is the hierarchical structure of files in the Explorer on the left pane. In the above example, my project contains the world_bank_intl_debt dataset, and a number of tables under this, such as international_debt and country_summary.

BigQuery offers 2 types of SQL queries — standard SQL and Legacy SQL , so pay attention to the differences when carrying out Internet web searches. As always, my suggestion is to migrate from legacy to standard. A typical example to illustrate the differences would be the use of square brackets in legacy SQL code:

SELECT * FROM [bigquery-public-data:world_bank_intl_debt.international_debt]
LIMIT 10

Whereas standard SQL requires a reverse single quotation, as well as the dot notation instead of the colon:

SELECT * FROM `bigquery-public-data.world_bank_intl_debt.international_debt`
LIMIT 10

However, this can confuse users especially when viewing the table information within BigQuery, where the colon is used in the Table ID.

The LIMIT keyword is probably the biggest difference when querying big data, compared to more traditional database management systems, and defines the number of rows to return. I’ve got into the habit of using it in most of my development or data analysis queries by default, especially when I am profiling the data. Although a table is fully scanned, the engine does introduce performance benefits when returning results that span multiple workers.

Now we can start carry out more complex queries, such as GROUP BYs, see execution details and view the query results in table format or even JSON.

SELECT country_name , count(*) AS Country_Count
FROM `bigquery-public-data.world_bank_intl_debt.international_debt`
GROUP BY country_name
LIMIT 10

One of the great things you will also notice is that BigQuery caches the results of the query in a temporary table, which is available for up to 24 hours later. Therefore, I would access this temporary table in the FROM clause of my subsequent queries, due to the fact that, according to Google, “bytes scanned as part of reads from temporary tables are free and do not count” towards data usage free tiers.

This article aimed to introduce us to the use of Google BigQuery for data analysis. To carry out queries on big datasets, navigate to https://console.cloud.google.com/bigquery and try it yourself.

References

Dremel: Interactive Analysis of Web-Scale Datasets (https://static.googleusercontent.com/media/research.google.com/es//pubs/archive/36632.pdf)

--

--

Che Kulhan
Che Kulhan

No responses yet