Objectives
- Try to understand what kind of metrics we can push to GA
- Is it doable for production usage?
- What metrics can we get from other cookie-less analytics (Cloudflare and PanelBear) ?
Methodology
- Use this personal website as a playground
- Use Cloudflare Workers to send data to GA
- Measure, Fix, and Repeat
Measurement Protocol
The protocol is the name given by GA to non-official API to push data to the analytic engine. There are two main versions:
- GA4 - Latest version, in beta, not well documented, low number of event that can be captured using the server-to-server API. Not very stable. I have discarded this version.
- GA3 - Universal Analytics - Seems to be the relevant option for server-to-server analytics as of April 2021. API is documented
Implementation
The measurement protocol API requires a Client Id
reference. This Id
is used to track user actions, by default the value is generated by the GA SDK for you using some black magic. In the current case, everything is server-side, so some information needs to be used to generate this Client Id
. The good news is that Cloudflare Workers provides a lot of information, let’s review their uniqueness.
Header Values
accept
: low - same for a browser all chrome users are likely to have the same valueaccept-language
: low - same for a browser all chrome users are likely to have the same valuecf-connecting-ip
: high - connecting ip - can be share for the same household or the same enterprise.user-agent
: low - same for a browser all chrome users are likely to have the same value
CF Extra Values
country
: low - country related to the IPtlsCipher
: low - list the cipher used for the TLS versiontlsVersion
: low - list the version used to the TLS connection - most likely 1.2 or 1.3city
: low - based on the related ipcontinent
: low - based on the related ippostalCode
: low - based on the related iptimezone
: low - based on the related ip
However, merging all those values into one unique hash can result in a temporally quite unique hash to measure the traffic on the website. Using only these values will not allow analysing:
- People on the go, IP or localisation will change over time. So they will be counted as a new Client
- Big corporate networks are likely to use the same outbound IP / even this is not sure as they can rely on service providers with many external outbound connections. In all case, traffic sharing the same IP are likely to have the same
ClientID
. - ISP may also share IP across multiple clients.
A pseudo code for the getClientId
can be:
async function getClientId(request: Request): Promise<string> {
let value = ""
const fromHeaders = [
'accept',
'accept-language',
'cf-connecting-ip',
'cf-ipcountry',
'dnt',
'user-agent'
]
fromHeaders.forEach((name) => value += request.headers.get(name))
if (request.cf) {
const fromCf = [
'country',
'tlsCipher',
'tlsVersion',
'city',
'continent',
'postalCode',
'timezone',
]
fromCf.forEach((name) => value += getProperty(request.cf, name as keyof IncomingRequestCfProperties))
}
const myText = new TextEncoder().encode(value)
const myDigest = await crypto.subtle.digest({ name: "SHA-256",}, myText)
return hex(myDigest)
}
The value can also contain a date like
YYYMMDD.ClientID
for instance, so the date can be used as a snowflake to avoid having duplicate information and to make sure the information is correctly split. This can be adjusted depends on the need.
The code to send the data looks like this:
async function analytics_g3(request: Request) {
const measurement_id = 'XXXXXX'
const url = new URL(request.url)
const params = new URLSearchParams({
'v': '1',
't': 'pageview',
'tid': measurement_id,
'cid': await getClientId(request),
'dl': request.url,
'dr': request.headers.get('referer') || "",
'ds': 'web', // datasource
'ua': request.headers.get('User-Agent') || "",
'geoid': request.cf.country,
'cs': url.searchParams.get('utm_source') || "(direct)",
'cm': url.searchParams.get('utm_medium') || "organic",
'cn': url.searchParams.get('utm_campaign') || "(direct)",
'ck': url.searchParams.get('utm_term') || "",
'cc': url.searchParams.get('utm_content') || "",
'ci': url.searchParams.get('utm_id') || "",
'gclid': url.searchParams.get('gclid') || "",
'dclid': url.searchParams.get('dclid') || "",
})
const langs = languageParser.parse(request.headers.get("accept-language") || "")
if (langs.length > 0) {
params.append('ul', langs[0].code)
}
// generate an unique client id
return fetch(`https://www.google-analytics.com/collect?${params}`)
}
I am not an expert of GA, some values might not be relevant or generates wrong data.
The code has been running on this website for a couple of days now and some metrics are showing up in GA.
Of course, don’t expect any values from user interactions to be available.
The code also logs all requests to GA, so traffic from bots is also available and will add noise to the metrics.
From a developer’s point of view, this can be nice to see bots, but for other people, this is just noise. It is possible to detect bots at the Cloudflare Workers level and adjust the behaviour, either add a new attribute or send the bot traffic to a dedicated GA account.
So the prototype is working!
Compare with others.
In the same time, I have also deployed 2 other cookie-less systems to compare numbers. Those solutions are javascript based running in the browser.
PanelBear
Over the last 7 days, the solution has recorded 44 visits! To be honest, I don’t expect more for this kind of website.
Cloudflare Analytics
Over the last 7 days, the solution has recorded 105 visits! The double of PanelBear, why that ? I did not investigate yet, both are working with Javascript. So either Cloudflare Analytics is not yet blocked or the unique visits count differ.
Both solutions are pretty far away from the GA values.
Can we use it in production? Probably.
Can we use values as it? Probably Not.
However, It is probably time to iterate to improve implementation based on the current learning.
Pandora box
The indirect obvious finding of this small POC is that Cloudflare Workers provides at no cost an easy way to track people in an invisible way (if you accept the fact that the generated client id might not be unique for a group of user).
This open a lot of questions but not for today…