Thinking About BIPA and Machine Learning

One article that really caught my attention recently discussed the use of Creative Commons-licensed images from Flickr as part of the MegaFace dataset for training facial recognition algorithms. Despite its aggressive (but not untrue) title, it highlights the many sides of the questions we the people and we the companies building products with these technologies face confront.

Focusing on the licensing, Flickr truly expanded the available commons of openly-licensed images by allowing its community to choose Creative Commons (CC) licenses. Interestingly, the latest version of the most permissive CC license expressly does not license "publicity, privacy, and/or other similar personality rights", yet the licensor agrees not to assert such rights to the extent necessary to support the rest of the license. However, previous versions of this or other CC licenses probably apply to many photos in the data set, and not all of the other licenses contain this language. For the Creative Commons licenses specifically, unlocking default copyright restrictions so users can create derivative works is the whole point; yet the organization acknowledges the difficulties when other rights overlap. A CC-BY-ML license, where the "ML" means you have permission to reuse the media for machine learning, would probably be a stretch given the requirements of the Illinois Biometric Information Privacy Act (BIPA). It has a nice ring to it. The biometric issues aside, any other ML uses I can imagine seem covered by the license already. But who knows what the future might bring.

Since the statute clearly contemplates business and corporate transactions in its legislative findings, and it goes on to cover "A private entity in possession of biometric identifiers or biometric information...," I agree with the university official quoted in the article that they'd escape a BIPA claim for including the photos in the data set. Though I'm sure there's been internal discussion about whether this qualifies as research on human subjects. Should the companies or other users of the data set be responsible to the subjects of the photos then? The article mentions but doesn't dig into one definitional problem I find with the statute. BIPA places restrictions on various things a private entity can do with "biometric identifiers" or "biometric information." From the statute,
"Biometric identifier" means a retina or iris scan, fingerprint, voiceprint, or scan of hand or face geometry. Biometric identifiers do not include ... photographs ...
There are more examples listed in the statute, and some other exclusions. Notably, MRI's are excluded, although they've recently been used to identify people. The definitions continue,
"Biometric information" means any information, regardless of how it is captured, converted, stored, or shared, based on an individual's biometric identifier used to identify an individual. Biometric information does not include information derived from items or procedures excluded under the definition of biometric identifiers.
When I first read these, I immediately wondered why the definitions didn't come up recently in Patel v. Facebook (summary, case), which introduced a circuit split over whether BIPA confers Article III standing to sue, a separate issue from the definitional one discussed here. In Patel, it's assumed that the face signatures created from analyzing uploaded images are a "scan" of "face geometry." Facebook uses these to match uploaded photos for its image tagging features. Note where the scan is included in the statute's definition. It's a list of methods that take info directly from a live person. Photos are then excluded, and the definition of "biometric information" excludes information "derived" from the excluded items (the photos). Since Facebook's face signatures are derived from photos, rather than something like a one-time scan from a live camera that directly produces the signature, they should be excluded by definition.

Wrong, according to various courts that have interpreted the same argument in the past. Here's one involving Shutterfly from a few years ago. Here lies another. It doesn't seem like the courts are going back to split hairs at this point. As a very privacy-conscious consumer of online services, it's good to see these laws given some teeth given how silently intrusive and ethically questionable many of these technologies' uses have become. At the same time, under the expansive readings of "face geometry" I don't want to see the courts go too far and start putting a chill on all intermediate or stored ML outputs for fear of litigation. Maybe it's ok for now, as we've seen many abuses of biometrics and one can't just reset them.

Compliance with the law's Section 15 doesn't seem so onerous but is it just one more area where only the larger players will remain aware and have the resources to comply? The NYT article notes some features in image-related apps that were disabled for Illinois users, presumably because of BIPA.  MegaFace highlighted images but it's easy to imagine the same happening with voice. Under the interpretations above, a data set of spectrograms produced from audio clips would likely be considered "voiceprints." The same logic could extend to any data set, and I wonder about secondary liability for using a pretrained model that you knew was trained on such a data set.

We still haven't figured out what to do about the Flickr images in the MegaFace data set. Are the privacy nihilists mentioned in the article right? Maybe one of the entities involved can set up a site where you upload an image of yourself, and if a model trained on the MegaFace data set recognizes you to a very high degree of accuracy, you can request removal (of both images). Is the Web ready for a DMCA regime for privacy? A similar idea was actually tried with the ImageNet Roulette project. Now offline, it allowed you to upload your picture and see how it was categorized according to labels generated from the ImageNet data set's "person" categories. Read all about it here. MegaFace is still available for download, and I think that's the right outcome. The despicable uses of AI for surveillance or oppression seem an unfortunate consequence of the potential for evil use of any technology. One can imagine plenty of other uses falling squarely within the terms of the CC licenses that a licensor might not like but would be powerless to stop. I'd like to see clearer legislation and communication with users, at least BIPA gets part of the way there.

Update: Looks like The Glass Room San Francisco recently featured "a custom built facial recognition system to search for your face amongst the millions of images used for training facial recognition algorithms." Check it out here.

Popular posts from this blog

Changing PDF Metadata with Python

Private Enough