How to Use Window Functions to Create a Moving Average in BigQuery
I typically use this blog to discuss Google Analytics 4, but I wanted to switch it up today and discuss a fresh new topic that we don't hear about too often these days: coronavirus.
A moving average is useful for smoothing highly volatile data points, and it's become the key metric that municipalities use to determine if new cases of COVID-19 are on the rise or decline. In this post I will explain how to use the Over clause to create a moving average. As a bonus, I'll provide a tutorial to use a Custom Query in Looker Studio to generate the COVID-19 chart shown below.
-
Changed all references to “data studio” to “looker studio”
Where The Data Comes From
Google is currently hosting several COVID-19 public datasets, but this chart uses data aggregated by the New York Times from US Health Agency reports (click here to see the data in BigQuery).
Ironically, I actually began pulling this data myself after reading about how comically incompetent my own Health Agency is here in Georgia (more on this topic if you're interested). In the end I'm stuck using the same questionable data, but at least I can control how it's visualized.
How the OVER Clause Calculates Moving Average
The OVER clause is an analytic function that creates a window of rows around the row being evaluated. For example, to calculate the 7 day moving average of new cases for March 15, 2020, I need to create a window that includes the prior 6 rows and then take the average of the "new_cases" column.
The SQL for this is quite simple:
AVG(new_cases) OVER (ORDER BY date ASC ROWS 6 PRECEDING)
If you'd like to try it out and impress your friends with a cool custom COVID-19 chart, keep reading.
Tutorial: Creating Your Custom COVID-19 Chart
If you are familiar with Looker Studio and BigQuery, just skip to step 3.
Step 1: Set Up a Billing Project in Google Cloud Platform
Okay, so if you'd like to follow along with this tutorial you will need to start by creating a Google Cloud Platform project HERE. Although this is a public dataset, it is possible that you will query it so many times that you exceed BigQuery's free tier and need to be invoiced (this is extremely unlikely), so you will need to enter a credit card and set up a billing account.
Step 2: Create Your Data Source in Looker Studio
In theory you could run this query from just about any data visualization tool, but to make all of those lovely lines and colors above you'll need to go to <https: data-preserve-html-node="true"//datastudio.google.com/>, click "Create" and select "Data Source".
Next you will select "BigQuery" from the list of connectors. This will bring up a window that looks something the one below. You can click "CUSTOM QUERY" and select the project you just created from the list of projects.
Step 3: Copy/Paste this Query
SELECT c.date AS date, c.confirmed_cases AS cumulative_cases, c.confirmed_cases - p.prior_day_cumulative_cases AS new_cases, AVG(c.confirmed_cases - p.prior_day_cumulative_cases) OVER (ORDER BY c.date ASC ROWS 6 PRECEDING) AS day7_avg_new_cases, c.deaths - p.prior_day_deaths AS new_deaths, c.deaths AS cumulative_deaths, @yellow_alert AS yellow_alert, @red_alert AS red_alert FROM `bigquery-public-data.covid19_nyt.us_counties` c JOIN ( SELECT date as prior_day, confirmed_cases AS prior_day_cumulative_cases, deaths AS prior_day_deaths FROM `bigquery-public-data.covid19_nyt.us_counties` WHERE state_name=@state_name AND county=@county) p ON c.date = DATE_SUB(p.prior_day, INTERVAL -1 DAY) WHERE state_name=@state_name AND county=@county
You may have noticed that this is a parameterized query that sets 4 variables:
“state_name” and “county”
If you prefer you can remove these and add a filter to your dashboard to modify the State and County, but the query runs MUCH faster this way. The downside is that only people with EDIT access to your dashboard can modify those values.“yellow_alert”
When the 7 day moving average exceeds this line, my family starts limiting our trips.“red_alert”
When the 7 day moving average exceeds this line, we go back into full on lockdown.
To use these, you will need to click “Add Parameter” as shown in the screenshot below.
Once this is done you can hit "Save" and follow the prompts to create the data source.
Step 4: Copy My Sample Dashboard
Finally, click THIS link to view my dashboard in Looker Studio and create a copy. When you do this, you'll replace my data source with the one you created in Step 3.
That's it. Any time you view this chart it will retrieve the data from BigQuery that is currently being updated daily by the NY Times.
Other Resources for BigQuery and GA4
Here are a few other posts that I’ve created for using BigQuery with data from Google Analytics 4:
- Cross-Stream First-Touch Attribution with Google Analytics 4 and BigQuery
- How the WITH Clause Can Simplify Your SQL Queries
- BigQuery + GA4: How To Get the First or Most Recent Event for a User
- How to Use the Over clause to Create a Moving Average in BigQuery
- Data Studio + BigQuery + GA4: Viewing a Parameter Across Multiple Events
- Data Studio + BigQuery + GA4: How to View Multiple Parameters on the Same Event
- How to Flatten a BigQuery Table with UNNEST