Anyone can make a guess about why it’s taken 20 years or so for the digital media industry to make standardizing data nomenclature a priority. Admittedly, “data nomenclature” doesn’t really roll off the tongue, and if anyone thinks spreadsheets are sexy, they’re probably a pretty niche group.
Yet data nomenclature is having a moment, thanks to some high-profile deals where data flow is a key factor. The IAB has launched the Data Nomenclature Initiative, aimed at creating standards around data field headers industry-wide, across publishers, agencies and buyers. Mediasmith CEO David L. Smith hatched the idea for the new initiative, and he spoke with AdMonsters about the challenges of ETL (extracting, transforming and loading data sets) when so many players in the space are using different terms for the same data, and about what he wants this initiative to accomplish now.
Smith will be talking more about this at our upcoming OPS in New York City on June 7, but until then, let him explain why it’s high time we talk about and try to unravel the inefficiencies that come from data nomenclature as we’ve known it.
What involvement do you have with the IAB’s Data Nomenclature Initiative, and why is it a meaningful issue to you, to address right now rather than some other time?
I came up with the initiative and presented the concept to the IAB and AdMonsters, both of whom have endorsed it. It’s a meaningful issue as I want to reduce friction in the flow of data in our industry and for my company, Mediasmith, a mid-major media agency. Agencies, advertisers and publishers all access data from others. Each vendor has different names for the same data fields. Too much time is spent transforming the names, normalizing them by hand.
Currently, there are many deals being made to facilitate flow of data — for example, the recently announced Mediaocean and Rubicon agreement. Rubicon already provides similar data to STRATA, but the field names Mediaocean uses differ from STRATA. Each of these deals will have custom solutions that will take months to develop. Every month that goes by sees more invested by various parties in one-off solutions. A quick resolution to this issue will not only reduce friction in day-to-day analytics, but facilitate easier execution post deal.
What are the stakes in solving or not solving the problem of a lack of standards for data field headers? More specifically, are there dollar figures you can pin on the inefficiency in how data fields are combined now?
I cannot put a dollar number on it. Suffice to say that ETL, of which this is a major component, is one of the greatest inefficiencies in agency media work today. AdExchanger estimates that 80% of an analysts’ time is spent on ETL and only 20% on actual analytics.
Doing this now also gives us the chance to fix a 20-year-old problem of using legacy terms to inaccurately portray the data in the field. For example, “estimate” comes from broadcast, but the more accurate term for us would be “authorization.”
An unintended but positive outcome of this initiative could be the facilitation of other uses of data, such as the failed industry initiative for e-invoicing, where the failure of systems to understand each other was a major hurdle.
Who is most affected by data nomenclature problems, and what can those parties do to smooth the process and make real progress on their own?
Agencies, advertisers, publishers and aggregators all suffer from this. The big entities employ database professionals to handle this problem. But it is simply not affordable for the mid-major agencies and publishers.
Tell us more about what ETL (extracting, transforming, loading data) means on a day-to-day level. What happens during that process and what are the challenges in it?
There are a growing number of companies facilitating the automation of this. They too see the inefficiency. Extracting is enabling direct hooks into publisher data on the part of the agency/advertiser, often through access of an API.
Transforming is the often manual process of normalizing the data so each publisher or data source (third-party ad server, DSP, etc.) has its own tab on a spreadsheet. Once normalized, these tabs can be totaled to represent overall campaign performance.
Loading has to do with combining the newly-added data with prior campaign data and placing it in a reporting mechanism, sometimes a spreadsheet or PPT, but increasingly a BI engine or dashboard.
What’s your utopian vision for this initiative? What’s the outcome you want to see?
Automation of ETL. The result we want is real-time or near real-time (daily) availability of updated data in dashboards for all constituents to see. For many, reporting is done weekly, as that’s the amount of time it can take to pull an ongoing campaign set of data together. With ETL, when automation enabled by accessing APIs can extract the data automatically each night, port it over to a cloud service such as AWS where previous data is stored, then downloading the newly combined data to a BI or dashboard server, it becomes possible for analysts to have updated information each day on arriving at work. Then they can spend their time doing something about improving campaigns rather than jockeying data.
It will also make the cost of using outside ETL and analytics companies significantly and make these services affordable to a much broader range of advertisers. Easier and more affordable flow of data will help the entire industry.