Run Dataform queries for the latest GA4 daily table export

Why I wrote this blog

One of the exciting new opportunities that GA4 brings over Universal Analytics is the export to Google BigQuery with free GA4 properties. It will undoubtedly cause a huge increase in the usage of Google BigQuery by marketers.

However, working with GA4 data in BigQuery also has some challenges. One of these challenges is that the timing of the daily export is pretty random. Another challenge is that when a project gets bigger, it gets increasingly difficult to keep the project (cost) efficient, well-organized, and maintainable.

Luckily for us, both challenges have solutions. We can build a small pipeline that executes precisely when a new export to GA4 happens, regardless of the timing, for example. And tools like dbt and Dataform are perfect for managing growing BigQuery projects.

Taneli Salonen wrote an excellent blog about triggering Dataform queries immediately after the GA4 BigQuery export happens. My blog builds on his work. Please read his blog before continuing with my small addition.

I feel one small but very useful thing is missing in Taneli’s tutorial though. We can trigger Dataform when a table is added to BigQuery, but I also want to know what table is added by GA4!

This is important because if you want to build an efficient project in Dataform, you’ll want to build incremental tables, based on only the new data from the latest export. It’s often not necessary to reprocess all previous days again.

There are other blogs about this challenge. Simo Ahava (who else) wrote a blog that explains how to run a scheduled query with the latest ‘table_id’. His solution uses a Cloud Function that extracts the tableId from the BigQuery logs. He’s not using Dataform though. And Artem Korneev from gtm-gear.com wrote an elaborate three-part blog series about triggering Dataform, including a passage about extracting the tableId from the logs.

In this blog, I want to combine Taneli’s instructions with learnings from Artem’s blog series. I’ll explain how you can run Dataform queries immediately after the GA4 BigQuery export to build incremental tables based on the latest addition to the GA4 dataset.

Set up Cloud Logging, PubSub, and Workflows

Of course, I could rewrite Taneli’s blog, but that feels a bit redundant. His tutorial is excellent, so I’d recommend heading over to his website and following the instructions there.

We’ll need to do one thing differently compared to his instructions. Instead of the Workflow definition by Taleni, we’ll use this one instead: (continue reading to the end for an even better version)

As you can see, I’ve added a couple of lines. I’m taking the content of the log sink, decoding it, extracting the tableId, and adding it to a variable called GA4_TABLE.

Now we can move over to Dataform and see how we can use this variable in our project.

Build incremental tables in Dataform

This part of my blog builds on Artem’s blog series on Dataform. Scattered across the three articles he explains how to run Dataform incrementally or do a full refresh. Building tables incrementally means that new rows are added to an existing table instead of replacing the entire table. This is a lot more efficient compared to reloading all the data each time.

The trick here is to use the latest daily table for an incremental run, but use the entire table for a full refresh. By combining all the instructions we can do just that!

Add a variable to dataform.json

In your dataform.json file, add the “GA4_TABLE” variable and give it a default value “events_*”. This will add a projectConfig variable to your Dataform project.

Add the projectConfig variable to your table declaration

A best practice for organizing a Dataform project is to define your source data with declarations. In our case, this will be the GA4 data in BigQuery. In the declaration, we’ll add the database and dataset from the GA4 data, but instead of adding the table ID directly, we’ll add the variable from our dataform.json file.

How to use the projectConfig variable in a SQL query in Dataform

To use the projectConfig variable in a SQL query in Dataform, you can reference it using the following syntax: dataform.projectConfig.vars.VARIABLE_NAME. In this case, we want to use the GA4_TABLE variable to build incremental tables based on the latest addition to the GA4 dataset.

In Dataform, you have the flexibility to choose between two approaches when updating your datasets: incremental and full refresh. With incremental updates, only the latest changes are added to existing tables. We can use the command ${ when (incremental(), ...} for an incremental update and ${ when (!incremental(), ...} for a full refresh. These commands are like little guides that tell Dataform what SQL magic to perform, depending on whether you want to add data from the latest daily table (think events_20230901) or replace all the data for a full refresh (e.g. events_*). Dataform will automatically adjust the SQL accordingly.

Here’s an example SQL query for Dataform that calculates the number of sessions per day, demonstrating how to incorporate the projectConfig variable:

The important part is in lines 15 to 17. In the query above, the ${ when (incremental(), ref(dataform.projectConfig.vars.GA4_TABLE))} expression ensures that when running an incremental build, Dataform will use the table specified by the GA4_TABLE variable. On the other hand, when performing a full refresh, the query falls back to using the entire GA4 dataset.

Dataform’s ability to build tables incrementally is a powerful feature. By adding only one day of data at a time, you can efficiently update your tables without the need to reprocess all previous days. This not only saves processing time but also helps to keep your project cost-efficient and maintainable.

Downsides

The major downside of this approach is that it is critical for the Workflow to succeed in order to build a complete table. If the Workflow fails, we’ll have a gap in our data. Unfortunately, I’ve noticed that this Workflow can actually fail quite often if it fails to connect to a connected GitHub repository. (See the screenshot below.)

The Workflow failed because the GitHub repo could not be reached.

A possible workaround for this is to add “retry steps” to the Workflow that will execute a step again if it fails. After adding this to the Workflow above, we get the following Workflow:

This is the Workflow I’m currently using.

Wrap up

In conclusion, leveraging Dataform’s incremental build capabilities enables you to efficiently manage your BigQuery projects by adding new data incrementally, optimizing costs, and maintaining a well-organized project structure. By incorporating the projectConfig variable in your SQL queries, you can dynamically reference the latest daily table, ensuring that your incremental builds are always up to date.

An alternative Criteo Tag template for server-side GTM

One of the clients we work for at Adwise uses Criteo for some of their advertising. And just like all the other advertising vendors, Criteo has a tracking script to measure and report campaign performance. Not all advertising vendors also have a Tag template for Google server-side GTM though. So I was happily surprised that Criteo does. Unfortunately, that Tag template has some dangerous flaws that prompted me to (heavily) modify it.

In this blog post, I’ll explain what’s wrong with Criteo’s template and I’ll share my updated version.

[2023-03-10] Update: After having a conversation with Criteo, I learned that my original template was using a timestamp in local time instead of UTC. Unfortunately, this was not made clear in the documentation. As a result, I have updated my code to use a UNIX timestamp in seconds that is based on UTC.

The official Criteo Tag Template

If you search in the Google Tag Template gallery, you’ll quickly find the official Criteo Tag template. The fact that it is in the official gallery gives it a bit of (false) credibility.

The Community Template Gallery

So you might be wondering what is wrong with the official Criteo template. Let me show you. The code in the screenshot below is from the official template:

A code snippet from the official Criteo Tag Template

On line 22 they use the method getAllEventData() to get all the data from the Event Data Object and store it in the variable postBodyData.
Then on lines 32 to 39 they remove some data if the user has a specific cookie. This is the cookie ‘crto_is_user_optout’ (not visible in the screenshot). Note that this only works if you have a Criteo opt-out cookie. It does not work with your own Consent Management Platform, such as Cookiebot or OneTrust.
On line 41 the entire Event Data Object is stringified and eventually on line 44 the function is called that sends all the data to Criteo.

Because the official template uses getAllEventData() they will likely collect more data than needed. It's almost as if they want you to accidentally breach the GDPR…

Of course, you could be very strict by making sure that the Event Data Object only contains data that Criteo needs when the tag is fired. So I reached out to Criteo about this and asked them if they have any documentation about what data is expected in the Event Data Object. After a quite lengthy email conversation, this is the most helpful response I got:

In the case the customer wants to use the GTM S2S template we get all the details of the events from the default function getAllEventData … Once an event is triggered we get all the information from GTM.

They’re just repeating my point. Not helpful at all. Time for plan B!

My unofficial Criteo Tag Template

Criteo does have documentation about their server-to-server endpoint. (https://guides.criteotilt.com/onetag/s2s/) One interesting distinction between this documentation and the official template above is that the endpoint differs. Criteo’s template sends data to https://sslwidget.criteo.com/gtm/event?mappingId=..., while their documentation mentions https://widget.criteo.com/m/event?version=s2s_v0.

I heavily modded the official template to do three things:

  • Send data to the endpoint as described in the documentation.
  • Translate standard GA4 event names to Criteo event names. 
  • Only include the data that Criteo needs.

These standard GA4 event names are mapped to Criteo events:

page_view when page_location is /viewHome
page_view for all other pagesviewPage
view_item_listviewList
view_itemviewItem
add_to_cartaddToCart
view_cartviewBasket
begin_checkoutbeginCheckout
add_payment_infoaddPaymentInfo
purchasetrackTransaction
loginlogin

The new tag should be triggered on these events:

pageview|view_item_list|view_item|add_to_cart|view_cart|begin_checkout|add_payment_info|purchase|login

How to use this template

  1. Download the template.tpl from GitHub
  2. Create a new custom Tag template in your server-side Google Tag Manager container, and import the template.tpl file
  3. Add a new Tag using the template you just added and configure it correctly. You might need to reach out to your contact at Criteo to get all the necessary data, such as Application ID and caller ID.

Just as with the official template from Criteo, you’ll also need to add the “Criteo User Identification” Tag template to your web container. You can find this template in the Community Gallery for web containers.

How to block duplicate transactions with server-side GTM and Firestore

Duplicate transactions can be a big issue for clean and correct measurement in web analytics. There are several solutions to block duplicate transactions with Google Tag Manager, but most use cookies to store transaction ids that have already been sent. I wanted a pure server-side solution for this without the use of cookies. This article describes how I used server-side GTM and Cloud Firestore to do just this.

Google Firestore

Google Firestore is a database. Or more precisely, it is a serverless document database that uses NoSQL and it is very suitable for (near) real-time applications. That also makes it a perfect companion for server-side GTM.

It’s probably most convenient to create a new Firestore database under the same Google Cloud project as your server-side GTM container. Open Google Cloud by going to console.cloud.google.com and then select the project that contains your server-side GTM container. You can find Google Firestore by using the search bar on the top or by navigating to console.cloud.google.com/firestore.

Costs

Google Firestore is not free, unfortunately. But fortunately, there is a generous free tier that includes 50.000 read actions and 20.000 write actions per day. This means that your shop can have up to 20.000 transactions per day before this would cost any money. And if you have that many transactions, then the costs for Firestore will probably not be your biggest concern.

Upgrade to Firestore Native

When you go to the Firestore console for the first time you might see a notification that says: “This project uses another database service”. This means that your Google Cloud project is using the older Cloud Datastore service or that your project is running Firestore in Datastore mode. Either way, you should upgrade this to run Firestore in Native mode.

This project uses another database service notification

To upgrade Firestore to Native mode you can click the button and follow the steps.

Google server-side GTM

Server-side tagging means that you can run tags, such as the Google Analytics tag, in a server environment instead of in the client. This has several advantages such as improved website performance and better data security. We can use it to verify and clean data before it is forwarded to a vendor. And we can also use it to verify if the transaction we want to send to Google Analytics has been sent before.

The idea is to store every transaction ID in Cloud Firestore and only send transactions to vendors when the transaction ID is not found in our database.

Write to Firestore

The first step is to store the transaction ID of every transaction in Firestore. We need to write a custom Tag template for this that uses the Firestore.write function to write data to a Firestore document or collection.

Create a Tag template

You can create a new Tag template by navigating to the ‘Templates’ tab on the left-hand side in your server-side GTM container and clicking the ‘New’ button under Tag Templates. Here you can configure your new Tag, but you can also import my (or someone else’s) template as a starting point. To import a Tag Template, click the three dots in the top-right corner and click import. Download [this .tpl-file] and import it to GTM. And don’t forget to click the Save button after importing.

Create a Tag from a template

Now that we’ve created a new Tag Template, we can use it as a Tag. If you’re using my template, you’ll see that there are two fields to fill in. The first field is your Google Cloud project ID of the project which the Firestore database is under. The second field should point to a document path. You also need to add the same fields to Permissions tab in the template.

Create a Tag

After creating a Tag Template, we can use it to create a Tag with it. This Tag should be triggered every time a transaction takes place. If everything goes as planned you should see new entries appear in Firebase for every transaction now.

New entries in Firestore

Read from Firestore

There’s no need to create a custom template to read from Firestore because there’s a variable for this already available in every server-side GTM container. When you create a new variable in your server container, simply choose the ‘Firestore Lookup’ variable from the options.

In the variable settings, you should enter the same Collection Path as you’ve used in the Tag that you created earlier. Under that, you can enter your query conditions. In our case, we want to check if an incoming transaction_id has been registered in the database already. If the transaction_id is found, then the transaction should be blocked. And if it is not found, it can be forwarded.

The Firestore Lookup variable

Block duplicate transactions

Blocking a transaction is now a matter of adding a blocking trigger to the tags. The blocking trigger should trigger on every event where the value of the variable that reads from Firestore is not undefined. Or in other words: the blocking trigger should fire if the transaction_id was found in the Firestore database.

The blocking trigger