Probabilistic Identifiers and the Problem With ID Matching

By Gavin Dunaway December 30, 2015

Programmaticcross-channeldeterministicfirst-party dataid matchingprobabilistic

Device fragmentation has had a complicated effect on publisher efforts to understand and target their audiences. Where once a publisher could easily track user behavior and deliver targeted advertising across a site – or network of sites – with the help of HTTP cookies, that tool is virtually useless today outside web browsers. The same person may appear as three different users when they access publisher content through desktop Internet, a smartphone or a tablet.

We shouldn’t go so far as to say the HTTP cookie is “dead,” even though so many of the trade pubs have felt perfectly comfortable saying it is. It’s more like cookies perform a limited function as an identifier, and in order to do good ID-matching, it’s important to look at a whole range of other identifiers. These identifiers fall into two camps: deterministic and probabilistic IDs.

Deterministic identifiers are straightforward enough. These are based on some kind of identifiable data: for example, a log-in to a site, behind which is likely a name and email address, and maybe some other information shared between user and collector (see sidebar on registration data). They also may include offline customer data or IDs. The point is that these IDs directly relate to specific users, though the ID itself is coded into a long string of integers to alleviate privacy concerns. Of course, privacy fears remain because theoretically the IDs could reveal personally identifiable information.

For prominent examples, consider software-/platform-based IDs, like Facebook, Google, or Twitter; mobile software ID, like Apple’s IDFA and Google’s Android ID; or publisher-based IDs, like Amazon, The Weather Company, or AOL.

Probabilistic IDs are a little more complicated, but they play a crucial role in cross-platform ID matching.

So, What Is Probabilistic ID?

Probabilistic identifiers use a wide range of signals—sometimes hundreds—to build cross-channel user profiles. What kind of signals? Publicly available JavaScript commands, ad-serving data, piggybacked iFrames—information providers deem “innocuous.” As an industry expert told AdMonsters, “If you started to take away the data that we use to do the probability-based ID, pages would not load, videos will not stream.”

Basically probabilistic ID providers use software that analyzes all of these regularly occurring digital media signals (anything from current browser version to device type to country time zone to shared IP addresses) to build user profiles across platforms. Each one has their own special formula in constructing their IDs. Occasionally this process is called “fingerprinting,” though the term has a negative connotation and providers try to avoid it.

Unlike deterministic IDs, probabilistic IDs are typically not tied to hard identifying information (e.g., emails or customer IDs). Deterministic IDs like mobile software or platform IDs are sometimes though not always used as seed data in building the IDs (though probabilistic IDs can be matched with deterministic IDs). Ironically enough, providers try to skate around privacy concerns by highlighting their lack of veracity, hence the name “probabilistic ID.” Companies worth their salt boast 70% to 95% accuracy rates (based on comparisons to deterministic IDs).

Though not as precise as deterministic IDs, the probabilistic method offers a scalable way to map out user behaviors across devices with limited to no reliance on PII. Profiles can be built that unify users across platforms, applications and even operating systems. Finally, the probabilistic technology features learning algorithms that grow more accurate over time.

Walled Gardens and ID Matching

When different sites, platforms, mobile operating systems and tech providers track and target using proprietary deterministic ID systems, they can’t interpret each others’ IDs. The same person on Facebook and Amazon might register as two different users to an advertiser.

This is a conundrum known as “walled gardens”—basically, deterministic IDs can almost only be used to target and understand the audience on the platform or publisher ecosystem where it is native. The ID facilitator could also use its IDs for audience extension or retargeting users on third-party locales (e.g., websites, mobile apps).

Walled gardens aren’t terrible: first and foremost, they bolster data privacy by not sharing. However, for a deterministic ID to scale, the provider must have a huge audience—e.g., Facebook, Amazon, Apple iOS. Then advertisers are rightly wary about putting most of their spend through a single platform or ecosystem—they want to diversify and meet their audience in a variety of places. Relying on deterministic IDs, they could be reaching the same audience on multiple publishers and applications and have no way of knowing. Publishers, on the other hand, are nervous about working with a single platform’s ID because it may limit advertiser spend and these platforms have ulterior revenue motives that can create conflicts of interest.

Matching deterministic IDs is a key practice in de-duplicating data and identifiers within a DMP or CRM system. Linking is accomplished by finding a common key, an exact match between two data components. This works great on a small scale, within a closed system where specific business rules can pinpoint components or unique identifiers.

Unfortunately, the process of matching proprietary deterministic IDs is not as easy or widespread as cookie syncing, which enables scalable programmatic targeting between DSPs and SSPs on desktop Internet. In that case, SSP cookies are passed back through JavaScript during RTB-enabled transactions, and can be passively read and matched by DSPs. Identifiers cannot simply be passed back—they must be purposefully shared.

A smart strategy will put both deterministic and probabilistic methods to work in building cross-channel user profiles. Instead of creating “universal IDs,” providers will build graphs that link together users’ disparate IDs across channels and devices.

This is an excerpt from the AdMonsters Playbook: Cross-Channel Data. Download your copy today!

More Content You Might Enjoy: