When working with Google Analytics 4 data in BigQuery, it is common to analyze the first or most recent activity for a user. For example:
- How long has it been since each user has logged in?
- For each user, how much time has passed between their first and most recent sessions?
The purpose of this quick post is to share how this can be done using the OVER clause.
Let me start by sharing a full example, and then I will explain the syntax:
# For each user_id, pull data from the most recent login event SELECT user_id, most_recent_login FROM ( SELECT user_id, event_date AS most_recent_login, ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY event_timestamp DESC) AS row_num FROM `<project id>.<dataset>.events_<date>` WHERE event_name="login") WHERE row_num=1
The goal of this query is to find the most recent login date for each user. To accomplish this, the query completes the following things:
- Pulls all events where the event_name is “login”
- Groups them by the user_id (PARTITION BY…)
- Uses the event_timestamp to order the events in the sequence that they occurred (ORDER BY…) and gives them a row number (with the most recent first)
- Then drops all of the rows where the row number is not equal to 1
Syntax of the OVER Clause
You may recognize the OVER clause from other posts on my blog (such as THIS ONE on how to calculate a moving average). The OVER clause is very useful for analyzing the event-level data that is generated by Google Analytics 4, because it helps you apply a function to only the rows that you specify, and in the order that you specify.
Here is what the syntax looks like:
<function> OVER ( [PARTITION BY] [ORDER BY] )
PARTITION BY = This is the column that you would like to use to create subsets of your data for analysis (sort of like using GROUP BY).
In my example I’ve partitioned by the user_id (the Google Analytics 4 identifier for a logged-in user), but if you are analyzing date for users are not logged-in you should replace this with user_pseudo_id.
ORDER BY = Just like it sounds, this allows you to control the order that rows will be evaluated by your function.
In my case I’ve set this to use the event_timestamp DESC so that the most recent timestamp would be first, but you could replace “DESC” with “ASC” to list your rows in ascending order.
NOTE: The OVER clause also lets you specify a subset of rows within the partition that you would like to apply the function to. I’m not covering it here because I didn’t use it for this example, but you can read more about it HERE.
Other Resources for BigQuery and GA4
Here are a few other posts that I’ve created for using BigQuery with data from Google Analytics 4:
- Cross-Stream First-Touch Attribution with Google Analytics 4 and BigQuery
- How the WITH Clause Can Simplify Your SQL Queries
- BigQuery + GA4: How To Get the First or Most Recent Event for a User
- How to Use the Over clause to Create a Moving Average in BigQuery
- Data Studio + BigQuery + GA4: Viewing a Parameter Across Multiple Events
- Data Studio + BigQuery + GA4: How to View Multiple Parameters on the Same Event
- How to Flatten a BigQuery Table with UNNEST