How to Use the Over clause to Create a Moving Average in BigQuery

A moving average for COVID cases.

I typically use this blog to discuss Google Analytics 4, but I wanted to switch it up today and discuss a fresh new topic that we don't hear about too often these days: coronavirus.

A moving average is useful for smoothing highly volatile data points, and it's become the key metric that municipalities use to determine if new cases of COVID-19 are on the rise or decline. In this post I will explain how to use the Over clause to create a moving average. As a bonus, I'll provide a tutorial to use a Custom Query in Looker Studio to generate the COVID-19 chart shown below.

  • Changed all references to “data studio” to “looker studio”

Where The Data Comes From

Google is currently hosting several COVID-19 public datasets, but this chart uses data aggregated by the New York Times from US Health Agency reports (click here to see the data in BigQuery).

Ironically, I actually began pulling this data myself after reading about how comically incompetent my own Health Agency is here in Georgia (more on this topic if you're interested). In the end I'm stuck using the same questionable data, but at least I can control how it's visualized.

How the OVER Clause Calculates Moving Average

The OVER clause is an analytic function that creates a window of rows around the row being evaluated. For example, to calculate the 7 day moving average of new cases for March 15, 2020, I need to create a window that includes the prior 6 rows and then take the average of the "new_cases" column.

Moving average

The SQL for this is quite simple:

AVG(new_cases) OVER (ORDER BY date ASC ROWS 6 PRECEDING)

If you'd like to try it out and impress your friends with a cool custom COVID-19 chart, keep reading.

Tutorial: Creating Your Custom COVID-19 Chart

If you are familiar with Looker Studio and BigQuery, just skip to step 3.

Step 1: Set Up a Billing Project in Google Cloud Platform

Okay, so if you'd like to follow along with this tutorial you will need to start by creating a Google Cloud Platform project HERE. Although this is a public dataset, it is possible that you will query it so many times that you exceed BigQuery's free tier and need to be invoiced (this is extremely unlikely), so you will need to enter a credit card and set up a billing account.

Step 2: Create Your Data Source in Looker Studio

In theory you could run this query from just about any data visualization tool, but to make all of those lovely lines and colors above you'll need to go to <https: data-preserve-html-node="true"//datastudio.google.com/>, click "Create" and select "Data Source".

Create Data Source

Next you will select "BigQuery" from the list of connectors. This will bring up a window that looks something the one below. You can click "CUSTOM QUERY" and select the project you just created from the list of projects.

Custom query project

Step 3: Copy/Paste this Query

SELECT
  c.date AS date,
  c.confirmed_cases AS cumulative_cases,
  c.confirmed_cases - p.prior_day_cumulative_cases AS new_cases,
  AVG(c.confirmed_cases - p.prior_day_cumulative_cases) OVER (ORDER BY c.date ASC ROWS 6 PRECEDING) AS day7_avg_new_cases,
  c.deaths - p.prior_day_deaths AS new_deaths,
  c.deaths AS cumulative_deaths,
  @yellow_alert AS yellow_alert,
  @red_alert AS red_alert
FROM `bigquery-public-data.covid19_nyt.us_counties` c

JOIN (
SELECT
  date as prior_day,
  confirmed_cases AS prior_day_cumulative_cases,
  deaths AS prior_day_deaths
FROM `bigquery-public-data.covid19_nyt.us_counties`
WHERE
  state_name=@state_name
  AND county=@county) p
ON c.date = DATE_SUB(p.prior_day, INTERVAL -1 DAY)
WHERE
  state_name=@state_name
  AND county=@county

You may have noticed that this is a parameterized query that sets 4 variables:

  • “state_name” and “county”
    If you prefer you can remove these and add a filter to your dashboard to modify the State and County, but the query runs MUCH faster this way. The downside is that only people with EDIT access to your dashboard can modify those values.

  • “yellow_alert”
    When the 7 day moving average exceeds this line, my family starts limiting our trips.

  • “red_alert”
    When the 7 day moving average exceeds this line, we go back into full on lockdown.

To use these, you will need to click “Add Parameter” as shown in the screenshot below.

Parameterized queries

Once this is done you can hit "Save" and follow the prompts to create the data source.

Step 4: Copy My Sample Dashboard

Finally, click THIS link to view my dashboard in Looker Studio and create a copy. When you do this, you'll replace my data source with the one you created in Step 3.

Copy dashboard
Previous
Previous

5 Best Practices for Creating Your Events and Parameters in Google Analytics 4

Next
Next

Translating Website Tags from Universal Analytics to Google Analytics 4