I don't have time to be detailed. I already know this is going to go long, and I can't make multiple passes. So, here we go. Got a couple big bases to cover, because I can't just write about the anomaly without getting into what it means and what it doesn't. I won't edit for form or structure, but spelling or grammar errors are fair game if spotted.
J/k, I've added things, for reasons that are listed. These things are the paragraphs in full italics.
Update: As this is getting some discussion again now that my reconstruction has proven true, I've realized I do my readers a disservice by merely suggesting "you too can do this if you're only more Bayesian!" which is not quite true. I've now written something to this effect and you might want to read it. It also serves as a brief introduction to Bayes' Theory and the so-called ICP method of observing the world, aka intuitive conditional probability.
Basically, the other day, in my Twitter feed was an ad for Orphan Black. (It's a TV show.) Now, the degree to which this was anomalous is only really comprehensible to a certain type of rationalist, but for those Rationalists, I will lay out a few things. For the rest who've stumbled here unwittingly, well…hopefully by the end you'll want to learn the tradecraft involved, I guess?
Nearly all ads I see on Twitter have an obvious and logical (albeit simple) origin, although I still ignore most of them, which means they've got work to do. Ads on Twitter are meant to be targeted, and I personally prefer this. As should you. Sorta.
This requires an aside, because most of my friends still don't quite have a grip on this, and I assume some will come here. And this is pretty important. (Not ads per se, but the line of reasoning involved.)
In a perfect world, ads and content discovery are synonymous. Mostly, anyway. Things that I don't know about, but would like if I did, are the things I want to be shown.
Adtech also wants this, ultimately. They want products in the eyeballs of exactly who would want it, if they had the time to look, and no one else. Perfect product-market fit. Anything else is more expensive. More expensive is bad for them, and bad for us, because then shit costs more, and we as a species use a bunch of shit to build even fancier shit, and somewhere down the line, we're a spacefaring superspecies, and believe it or not, cheaper, faster iPads do contribute to this.
Pity the fool who uses his trichorder to play Angry Birds, but if you are not glad of him too, then you are a fool. The scores of simple, happy folk who buy devices yet never really use their potential, allow the scale of manufacture that makes them cheap and ubiquitous for you, and me, and researchers who now have fucking trichorders.
So, most ads, I know where they came from. They're based on my social graph, and Twitter ad partners, which are many things. On devices you may get a unique "ad ID" that collects various things, and this info is then used to create more accurate ads. Which is good. These things are cryptographically pseudoanonymous, unless they're broken anyway, and hopefully most people are getting the general idea of the CAPS & LIMS of this sort of pseudoanonymity. (Insert link to pretty much anything about DPR, or Carl Mark Force IV & friends, here.)
But anyway, using (and attempting to connect) pseudonyms are normally just algorithmic tools. Really simple, simple AI. And adtech has learned to not show off much, because that freaks people out. (If you read no other links here, you should read that one. And if you are intrigued, and want the source of the Forbes piece, but don't think the NYT is worth the time and money to subscribe to—I certainly don't—the exerpt of Duhill's book they ran is here.)
I am not one of those people, by the way, because I know that Target is not going to come to my door and arrest me for being a pregnant teen, even if that were illegal. But if Westboro Baptist took over the government and made that a felony, I'd be understandably unhappy about this. Because they will beg, borrow, steal, subvert, and do anything to get such data.
But people forget who to blame. The problem isn't solved by Target learning to be more subtle with its algorithms. An even worse idea would be to stop using them, which is what many people think they want. The former is like leeches for cancer. The latter dooms us to an extinction event, sooner or later.
The problem is that there exists people who are willing to use force, even deadly force, to uphold their idiotic "laws", however unimportant, irrational, or unethical these laws may be.
For many reasons—and you'll just have to trust me a bit or this will get really long—the probability that I would be served that Orphan Black ad is extremely small, given the data available to Twitter. Sure, it could happen, maybe. And normally I wouldn't notice— well, I'd notice, and it'd be severe enough that I'd choose "This ad is not relevant" to spank the algorithm a bit, but that'd be all. That happens. At this point, it's usually because of something that's paying for really heavy, broad promotion. Awareness campaigns, more than ad campaigns.
If you soon find yourself wishing to depart the bus, citing unwarranted conclusions based on far too little data, well... first, maybe finish it on faith, so you have the whole point in your mind in case it starts to make sense later; second, at the bottom, I will link to a thing that may or may not help, if it still does not.
So here's what's interesting. Until the day before this, I had no idea what Orphan Black even was. But I went to a meeting at an associate's house, which is also her office. It's in a normal, residential area, and a 45-minute drive from where I normally go, most days. I've been there before, several times over a year or so. This time, her previous meeting ran long, so I hung out in the living room.
Orphan Black was on, and I "watched" about 20 minutes of it. Well, I read shit on my phone and tweeted for 20 minutes while Orphan Black played in the background. BTW, it isn't good, and definitely isn't something I'd ever watch, and probably should stand out as anomalous against a dataset of the things I do watch.
So what is the posterior probability of getting this ad, one day after this event, coincidentally, vs. the probability given audio fingerprinting? Basically, we're considering what are the odds of a coincidence of this a magnitude. It's incredibly small, btw.
(Clarification: "Incredibly small" as in, given the circulation of the tweet — see below — and my knowledge of what my social graph, profile, and online profile "looks like" datawise, what is the probability I received this PT without a novel / unextpected "mystery" data source to assist the graph, like audio fingerprinting or audio watermark detection. )
I figured I'd hunt down the PT in question, to further qualify the posterior probabilities for the reader. So, here's the tweet. I wish I could see "Impressions" but Twitter Analytics for others' tweets isn't public. The RTs + Favs give an idea of its potential range, though; and, we might now compare it to other PTs, etc.
Thinking like an algorithm, these are all things — the location, my patterns, etc — that make this high-quality data. To an algorithm, even a pretty sophisticated one that includes all that, the probability I did not choose to watch this show, and indeed over many other shows, is very small. And it doesn't "look" like a waiting room; it's a residence, probably a friend or family, if you're looking at the data.
Now, the probability of this coincidence if my device does some unknown amount of "listening to my surroundings", however, is very high—tempered only by adtech's adaptive desire to not show their hand, and generally avoid looking creepy. The latter is best explained as an algorithmic failure—algos don't do much to account for g-factor, not directly. I'm part of a very small group of people who would notice this and then also arrive at this conclusion. Non-Bayesians are probably screaming that you can't even do this. (But you can.) So, absent a better explanation, we will discuss the viability of something involving microphone audio as the latent variable.
It might be anomalous vs. my social graph and my interests, but the data quality was high that, to an algorithm, I should like to be reminded of this new (to me) show, and that was enough. To normal folks there's plenty of other explanations, if they noticed at all.
The algorithmic probability that I chose to check out this show is even higher given this. (We mutually follow each other, and there are 10 mutual points of commonality in those we each follow, a couple of which we both frequently interact with.) The effect of this on the probabilities is to raise the probability of any outcome where I see an Orphan Black PT.
Moreover, the way it'd work if I tried to formalize some numbers into this, p(seenBasedOnTwitterGraph) is higher, and while less severely so, it is still significantly anomalous—any decent algo will be confident that I do not own a TV, very rarely watch any, and will know with near-certainty that I do it online or download it, never "live" or "appointment" programming.
(Although it's unlikely the algorithm is set up to consider all of these details. It will make use of keywords like Orphan Black to strengthen the link between the above "fav" with my identity, but I highly doubt it would consider or know that the PT is for a season premiere and devalue it based on my live-TV preferences.)
But it also means something like p(watchedOnPurpose) could go from having, say, low- to medium-confidence, to a higher one. Plenty high enough to be comfortable concluding yes, if I were the algo designer anyway.
So it could be just Twitter itself—broadcast shows want you to Tweet and show hashtags, and for people who watch broadcast or cable TV at all, the device is probably still the "second screen", and Twitter is used while watching TV, not the other way around.
But it would really make more sense for it to be Google. I have 「"OK Google" everywhere」 turned on, on this standard Nexus 5 rocking stock Lollipop 5.1. And Google is really good at this stuff. In any case, they can know I watched a broadcast-origin TV show.
Because I was using Twitter during, and Twitter holds microphone permission, and the fact that social apps like Twitter do seem to heat up the phone too much, I can't rule it out. On the other hand, it'd be a lot easier, on the tech side, if this was Google. Overall it's too close to make a good guess.
As for the tech, I'm curious whether the show's audio is watermarked; it probably is for anti-piracy reasons already. It'd make sense to make use of it for this too. Trying to actually listen and decipher stuff is maybe doable—but the watermark way would require much less computation.
If watermarked, it's really 50/50 on whether Google is involved. If it isn't watermarked, the meter swings in Google's direction a bit. At least 90/10; maybe more. Google has algos for Content ID on YouTube already in their toolkit. That'd be out of Twitter's league, and suggests the device actively listens and fingerprints.
Probably not very closely all the time, for battery reasons; and if not all the time, then perhaps when apps with the mic permission are open. (Which would help explain why every damn app has that even when they definitely don't need it, but also definitely don't have the resources to be doing anything with audio data themselves.)
Without knowing yes or no on the watermark, there's not much point in discussing more specific options than the above.
*Update: I have realized a more conventional route — i.e. one what does not neccessarily require the microphone as a data source to correlate my device and advertiser ID with the show — can potentially satisfy the observed data, thanks to a question. I elaborated on this in that Twitter thread. The audio fingerprinting route seems less probable, in relative terms*.
No less probable in absolute terms, but it is no longer clearly the "most probable" explanation proposed that accounts for the observed world. It does not have the utility of better explaining phone heating and battery usage, or the ubiquitous microphone permissions in apps that have user-facing mic functions. The former can be potentially explained by apps making inefficient use of the radio when sending in-app usage data and text field logging. (Which I thought was pretty well-known, but this came up today.) The latter I had previously attributed to (that is, my best guess was) devs importing code libraries and thereby end up required to list permissions for code their app doesn't actually use. (Which is unfortunate because it makes ignoring the permissions list a norm by necessity. Every app asks for everything.) This remains no less valid. We could discuss how much we should give preference to a single theory (audio fingerprints) vs. a patchwork of them, but in any case it's not enough that I'd be comfortable sticking to the single theory without discussing this alternate.
The end result being the same, and as the apps still could be doing this from a tech standpoint, how you feel about this information is up to you. Hopefully the article can at least raise debate around the capability and the implications. In either route, it's still equally interesting to me, because of the implied extent to which different companies share and combine graphs and data, and their accuracy, and how quickly. In the simplest version, viewership data from Comcast (or AT&T) and my presence (via wi-fi background signals and their strengths — via Google's database and accessed during the Twitter session, and holds the required permissions — which is more accurate than GPS in wi-fi dense areas, and actually works indoors) are correlated in time and space. Not bad. Less clever than audio watermarks — you can't tell I was personally in front of the TV, but you could tell that I was using my phone and not moving around (signal variation and other sources). It'd also be clever if the light sensor is used at all. Done right, you could often be pretty certain a TV is being viewed in most lighting.
Further update: As my original theory is now proven true, we might revisit the above. When I wrote the first update, I didn't bother to nitpick and discuss the security of her wifi (WPA2, which I've never auth'd to) and my device configuration (I have an automator which toggles my wifi off after a certain threshold of movement, like driving, or walking some distance, like at a mall, unless overridden for a specific MAC address). Mostly because this is super long already and that's not really smoking-gun tier at discounting the geolocation theory. But the effect is that I still remained a bit more certain of the audioprinting theory in my head than I was willing to support on paper. Also, if I were reviewing this again, I would now be factoring what we have learned about the frequency of drive-by persistent spyware / malware installations on Android (it's frequent), and the persistence of all the auto-restarting daemons every bloody app always installs "to shorten app startup speed". The effect of THIS is, I would guess Google is too clever for this, and Twitter's ads are too DUMB for this overall, so it's probably neither of them firsthand. Rather, some other app (or even spyware installing itself) using a daemon to listen, perhaps when device is stationary but with the user, combines this with my Android advertiser ID and my profile data (including Twitter user ID) to provide data to advertisers, rather than Twitter itself.
No point beyond, hey, FYI, this level-of-detail in correlations is happening today.
(And for the sake of the debate, FYI, if there is no audio fingerprinting, this is fairly trivial at this point, technically-speaking; and therefore it will come sooner or later. Per legal contacts, the EULAs would in fact allow either version, as long as encryption and pseudoanonymization was used. And according to John McAfee, bank applications like Bank of America access the mic and record realtime audio for up to 30 minutes after certain events, for anti-fraud reasons. That necessarily means full audio.)
Whether it's the Twitter-only version, which is less likely and a medium deal, or a Google version, which is more likely, and a bigger deal (because of their skill at analysis, and their reach). But the difference between those two versions is only about two years of progress anyway, and in general, we don't operate under the assumption of either quite yet. And we must.
I don't blame Target, or Google, even when they're actually handing the data over; and to the extent that I do, they're not really any more culpable than the taxpayers themselves who fund, aid, and abet governmental human rights violations.
There's been a push for more "opportunistic encryption" — i.e. encrypt everything, whenever you can, just to make correlations and attacks harder or more costly, which is good — the database hacks of recent (that we know about) illustrate the low level of security we're dealing with here. From Target to Chick-fil-a, to Slack, etc... to actual fiber-optic cable splitting by NSA and friends, you must assume that any information collected and stored, often even when it's supposedly encrypted, can be had with some effort for a determined attacker. With perhaps some rare exceptions of Google's caliber, where server-client encryption is involved, the NSA has it already, through one or more exploits or configuration mistakes of varying stupidity.
It would be naive to assume that even without an implant—because all bets are off then—your surroundings, including audio, is analysed to a significant extent. How often, when, by who, and to what extent are more realistic questions than if. Within just the category describable as "metadata", that could include your proximity to other devices (correlated soundscapes), therefore people, and a whole bunch of other really cool — or really scary, in the "wrong" hands — stuff that the data analysis would unearth as more graphs and pseudonyms are correlated in space and time etc.
If you saw the seventh Fast & Furious film, the "god's eye" device — well, while certainly not chip-sized, with compromised devices, that "god's eye" shit could be done today, although it'd require nation-state resources and a ton of horsepower. With non-compromised devices, still almost all of that can be done. In either case, doing it realtime on just a couple targets with any reliability and accuracy is still a few years out.
And it's only a movie, but the concept is less incredible than you might think. It was incredible 16 years ago, in 1999 (Enemy of the State). AI wasn't involved, but bulk surveillance was laughable tinfoil-hat stuff.
It was incredible in 2008 (Eagle Eye), for the AI, and because the cars (and construction equipment) of 2008 wasn't really hackable yet, because the systems didn't have that kind of eletronics.
But in 2015, a lot of cars really are now hackable. Most construction gear probably too. Maybe it's just the exponential-improvement curve, but the Furious 7's "god's eye" device feels closer than 16 years away.
We're not talking actual tracking, like in Eagle Eye, but probabilistic tracking, like in F7. Realtime markov chain analysis of a shit-ton of data sources, to calculate what most probably happened. Consider the scenes where the device reports successful kills and then has to re-examine how it was fooled. This will be the sort of tracking that is possible to nation-states in maybe 3 to 5 years. Again, for high-profile targets, and "it" will be in vast datacenters. If there's ever a chip like in F7, it would be an interface, not doing the analysis. That miniaturization would require huge advancements in quantum compute and more, and by that point we'd better have long solved the privacy paradox.
But either way, this sort of thing is important to consider, if you are a subversive, and a "Westboro" is in power. Because another 3 -5 years after that, and it'll be available to be used against whole populations, or close. For comparison, 16 years ago, the tech that is now ubiquitous, mainstream, and abused like crazy — IMSI-catchers like the Stingray — barely existed, but required a vanful of equipment and was experimental. Exponential progress and all.
As for even right now, if they aren't looking for keywords yet, it's not a huge stretch for that to happen. The tech exists. Only privacy paranoia prevents it — and some of it well-founded, because idiots still pay taxes en masse. But Tech wants to do this, and we want them to, in order to get smarter at understanding your voice in various environments, so that Siri etc is more accurate, and more useful.
Speaking of cars, my father is a car guy. He recently installed a new transmission on a project car, and a bolt slipped or whatever. He found that while his static strength was enough to brace it, he couldn't manipulate it, and static strength doesn't hold out forever. Basically, this muscle situation, but where the weight can't be made to slide off, and with no spotter. He managed to somehow hold down the Home button on his iPhone with his elbow, summon Siri, and call my brother, who was upstairs, and came downstairs to rescue him. We then drank beer.
My father loves his iPhone, and he resisted it like crazy before we just got him one once, and he's been attached to it ever since. He texted while driving before, and still does now, only now much safer, because it's dictation. And he doesn't check the results. I can always spot a Sirified message.
What we want, beyond, like, a JARVIS that knows you're in trouble already, is to be able to grunt out "Siri… ambulance" and have Siri be able to tell that you did in fact say exactly that (and not "Siri, sandwich" or something), and that now is not the time to fuck around and ask questions, and be right about it.
Fuck, like, if not for HIPAA, big data could cure a ton of shit, or at least get researchers a lot closer. A study of 200 participants isn't ever going to cure anything interesting. But a few hundred million participants whose histories can be analysed for shit that even House would miss? You don't want that?
The problem isn't Big Data, and it isn't privacy paranoia exactly. But the paranoia is not properly identifying the source of the cancer. The unholy, incestuous, polyamorous fuckfest between Big Pharma, Insurance megacorps, lobbyists, and the different awful branches of government, where all three contribute to the problem here, is the problem.
At the heart of it all is the mythical social construct, and the Hobbesian Leviathan 1.0. That worked pre-internet, but now it's got to die. We're building a new, more perfect Leviathan, and it's quite the opposite of the one described by Hobbes, amusingly.
Our Leviathan, Leviathan 2.0, is a voluntaryist, hivemind-fueled, AI-augmented Leviathan.
Well, there's a loooottt of power behind the old Leviathan. This is gonna be, at least metaphorically, a pretty epic struggle as power changes hands. Ideally, most of this power will change hands in the form of social and monetary capital, but when pressed, we should not be surprised to see the old Leviathan lash out with physical violence.
We must try to prevent or subvert or mitigate the more violent possible timelines. The longer people misplace their privacy paranoia, the harder the transition will be, because the "god's eye" is coming either way, and the more power still held by the old Leviathan, the more violent this will be.
In the meantime, if you're more than an NPC but less than a Protagonist, the best advice is to be more random. Describing exactly what that means would be another post, and this one is far too long already. And no promises; I'm usually too busy to write, so you're better off forcing my hand and asking me directly.
And, if all this "probability" and "possible timeline" shit makes sense, and you think in those terms as well, even if informally (as you should; trying to put numbers to such probabilities, except in the form of Fermi estimates, is a Bad Idea), hit me up, especially if you're young.
Chaos Army is recruiting.
For anyone 20ish or older, especially a college-graduate, and especially if that was a Western university... if you made it this far, bravo, because after all, none of this proves or seems to suggest a bloody thing, at least not according to how Science is supposed to be done, right?
The best I can do is suggest you go here and really get deep into it. And most of that is about doing this where you have real, measured priors. Doing it with estimated Fermi numbers is even more tricky. But I don't have time to do anything more than send you there, for now.