Releasing search queries is a really really really terrible idea.
In which I address a recent New York Times opinion piece.
In August of 2005, the Bush Administration requested a month of search queries from Google and the other search engines. This was ostensibly to help them revive the then moribund child online protection act and Google (my employer at the time) successfully resisted the subpoena. Like many employed by Google at the time, I was pretty proud that we had stood up to what I thought was such vast overreach.
Right around this time, there was a huge amount of excitement in the machine learning community around AOLs release of a dataset of queries for researchers to pore over. The idea was that with working with real data, they could build the next generation of recommendation and matching algorithms. Later, in 2006, Netflix started the Netflix Prize which would award a million dollars to the person or team who ‘beat’ Netflix’s recommendation algorithm1 given the sample dataset of peoples’ search habits on that site.
In August of 2006, I worked peripherally with Peter Norvig and the research crew to release the n-gram dataset from Google’s books operation. At the time I was helping run2 the Search API, which would allow researchers to get some insight into search. This would evolve into Google trends. It was an exciting time in machine learning!
What about the subpoena? Requesting a month's worth of raw queries, even anonymized was not just a terrible idea technically, but we knew that there was very little chance that you could comply with the subpoena without doing grievous damage to user privacy.
The NY Times proved this point for us , when they took a look at the AOL data release, which had been “anonymized”. After a cursory look at the data, they were able to track down a number of users and contact them about this release. Further still, Netflix canceled the 2nd round of the Netflix Prize contest as they had released enough information to make it possible to derive who was searching for what in the dataset.
Search is *intensely* personal. Not that it’s not valuable for those looking to snoop on Americans, or compete with Google, but it’s just too personal to share. This is why I was honestly surprised to see Julia Angwin’s opinion piece “Breaking up google isn’t nearly enough” in this weeks New York Times.
Now before you say “look at the former Googler defending his tech-daddy”. I will say that I actually can think of a number of interesting and effective ways to break up my former employer, somewhat along the lines of how AT&T was broken up in the 80s.
But this proposal, in which Ms. Angwin proposes the forced sharing of user query data, is wildly disrespectful and destructive of human privacy. She’s not wrong in the value of the query data: It’s intensely valuable! It’s why so many countries have required google and other social media companies to expire data after 18 months (or less). It’s the target of every state actor since googles rise to prominence.
Releasing query data would be a danger to privacy so vast, damaging and deeply personal that I can’t imagine what the point of the article is. She even says in the article:
“Of course, privacy considerations are involved in releasing data about search queries. Back in 2006, AOL released some search queries to the public for research purposes, and some of the information allowed users to be identified. We have better and more sophisticated privacy protections these days.”
I’d go so far as to assure Ms. Angwin and the readers of the NY Times that “we” do not have better and more sophisticated privacy protections that would also preserve the value of the data. You can’t have it both ways.3
Furthermore to release this data stream would be such a grievous harm on American citizens privacy that it would be well nigh unrecoverable. If that’s the actual goal here, the end of privacy, then this is one way to do it? Do you think police overreach, identity theft, scams or predatory corporate behavior is bad now? It’s gonna get so much worse if you release this data.
Imagine you’re a young person searching for resources online to help understand your sexuality and having that pipelined to your local preacher/pastor? Kevin Bacon winning over a Oklahoma town with the power of dance makes for a fun movie, but in real life, you should be very careful who is looking over your shoulder as you search.
So, I can only ask that people step back from this precipitous cliff and find better ways of countering Alphabet’s monopoly power than destroying the privacy of all those citizens that we feel are victimized by the same.
For which movie or show to suggest next. At the time netflix was very interested in getting a user to watch the next episode, show or movie and retain the user on the service. In later years, we would talk about how to best recommend videos on YouTube to encourage folks to stick around. But that’s for another article.
..or had taken over from the inestimable Nelson Minar, i’m fuzzy on the timeline.
When you look at mechanisms like differential privacy, you find that the better they are at providing privacy, the less personal the resulting data becomes. So it literally decides to destroy or homogenize data to protect privacy, so it wouldn’t be useful in releasing the query stream.
That wasn't the only way AOL goofed. Elsewhere in the Google, I have it on reasonable authority, they looked at the released data and reverse-engineered reasonable guesses about costs, enabling Google to better compete on ad pricing against the House that Case built. Data is/are never one thing.