Open Police Data Re-identification Risks

May 17, 2016

Last week I spoke at a White House event, “Opportunities & Challenges: Open Police Data and Ensuring the Safety and Security of Victims of Intimate Partner Violence and Sexual Assault.” This event brought together representatives from government agencies, police departments, and advocacy groups to discuss the potential safety and privacy impact of open police data initiatives.

The White House launched the Police Data Initiative last year, encouraging police departments to make data sets available to the public in electronic formats that can be downloaded, searched, and analyzed. They are encouraging police departments to release data on use of force, pedestrian and vehicle stops, officer involved shootings, and more to build community trust and strengthen accountability. Last week the Administration announced that 53 jurisdictions have committed to the Police Data Initiative and over 90 data sets have already been released.

Data visualization map of continental U.S. police departments participating in the white house police data initiative.

Open police data initiatives are enabling increased transparency and citizen oversight. However, when records are readily accessible and easily searchable, there may be some undesirable consequences. Of particular concern is the possibility that people who access open police data may be able to identify crime victims or reveal their locations. For victims of domestic violence and sexual assault, this could put their safety and security at risk.

At the White House event, I spoke on a panel with Simson Garfinkel, who recently authored a NIST report on the de-identification of personal information (if you want to learn more about this topic, this report is a great starting point). I discussed the risk to crime victims from the release of police data sets and described some of the ways that victims may be re-identified, even if data about them has been de-identified. I encouraged the Police Data Initiative team to work with experts in privacy and statistics to better understand the risk and to develop guidelines that police departments can use as they decide what data to release publicly and what steps they should take to de-identify data.

The Data Identifiability Spectrum from the October 2015 National Institute of Standards and Technology Internal Report [NISTIR] 8053. As shown in this figure, all data exist on an identifiability spectrum. At one end (the left) are data that are not related to individuals (for example, historical weather records) and therefore pose no privacy risk. At the other end (the right) are data that are linked directly to specific individuals. Between these two endpoints are data that can be linked with effort, that can only be linked to groups of people, and that are based on individuals but cannot be linked back. In general, de-identification approaches are designed to push data to the left while retaining some desired utility, lowering the risk of distributing de-identified data to a broader population or the general public.
In response to the Administration’s initiative, a number of jurisdictions are already making police records available to the public online. I took a look at some of these public data sets last week. In one city I looked at, these records include complainant names, addresses, and ages. While some records, including those related to sexual assaults, have been removed, the remaining records appear to contain fully identified victim information. In another city names are removed, but detailed location information is still included. And another city removes the victim’s name and house number, but retains the name of their street. Looking at open police data archives available today, it is often unclear what de-identification process is being used or how rigorously it is being applied.

People often assume that if names and obvious identifiers such as street address and social security numbers are removed from records, those records will not be re-identified. However, researchers have shown repeatedly that such supposedly de-identified records may, in fact, be re-identified.

One of the simplest ways to re-identify records is to look for information that was inadvertently left behind when they were de-identified. Social security numbers and other identifiers in court documents are sometimes inadvertently included in public documents. When records contain narrative text, it may be especially easy to overlook information that might identify someone. Narratives may include descriptions of people or places that may be readily identified, even without explicitly including their names or postal addresses. For example, knowing that a location is “across the street from the elementary school,” may identify the location in the context of a particular report, even without providing an address. Completely scrubbing narratives of identifiable information frequently requires someone to read the narrative. Entirely automated techniques are likely to miss some information that allows humans to identify people and locations in context.

Geographic information is commonly used to re-identify people. For many of the purposes for which people want to use public data, geographic information is important. But due to privacy concerns, the granularity of geographic information may be reduced. However, we are faced with a dilemma. Data is more useful when it includes more specific geographic information, but is also more identifiable.

Knowing only the state associated with a police report, allows for the compilation of state-level crime statistics, but does not provide information about crime rates or policing patterns in specific communities.

Providing zip codes makes the information more useful, but it is more likely to be identifiable. While some zip codes are fairly heterogeneous, others are not. The zip code of a college town might include an unusually large number of residents aged 18 to 24. A police report might mention a 20-year-old female living in the zip code without much risk of identifying that individual. However, a police report that mentioned an 80-year-old female might inadvertently identify that individual, since there may be only a very small number of 80-year-old females living in that zip code. Likewise a young person who lives next door to a retirement community may be one of only a small number of people in their age bracket in that zip code. In a zip code where most people are all of the same race, those of a different race may also be readily identified in records that mention race.

At the same time, zip code may not be granular enough to identify community trends. It may be important to know on what block an incident occurred, not just the zip code. In a large city, a block containing high rise apartment buildings may include hundreds of residents. However, in the suburbs or rural areas, only a handful of people may live on some blocks. Thus when we report that a crime occurred on the 700 block of a particular street, we may narrow down the address of the victim considerably. If other characteristics of the victim are revealed, such as gender, age, or even approximate age, the victim may be uniquely identified. How many 37-year-old women live on the 700 block of Mulberry Street?

Another approach to re-identifying data is to use auxiliary databases. If common information exists in different databases, people may be matched across databases. Is there enough information in the de-identified database to find the corresponding record in a database of registered voters or real estate owners? Could de-identified information be matched to public profiles on social media?

When de-identified records are released publicly, it is important to consider what other databases might be used in conjunction with the de-identified records. However, it may be difficult to anticipate what new databases may be made available in the future by both government entities and private organizations. Data that may be de-identified today, may be re-identified at some future date when new databases become available.

The information used to re-identify records may also come from sources other than databases. For example, in 2013 a data analyst was able to match published paparazzi photos of celebrities getting in and out of taxis with trip data from New York city taxis and determine the start and end points of these celebrities’ trips. However, only 11 celebrity trips out of 173 million trips in the dataset were re-identified in this manner. Perhaps more troubling, the analyst was also able to determine residential addresses corresponding to the end points of trips that began in front of adult entertainment clubs, thus identifying patrons of those clubs.

Harvard Professor (and former FTC Chief Technologist) Latanya Sweeney demonstrated that she could match published news reports about accidents with de-identified patient-level health data released by the State of Washington to re-identify these medical records.

Sometimes when records are de-identified, names are replaced with a pseudonym, generally a random-looking number. This allows multiple records related to the same person to be tied together, without identifying that person. However, a series of records about the same individual makes it easier to re-identify that individual. For example, if records contain location information and the same location appears repeatedly in multiple records, it is likely that is the location of a person’s home or workplace.

In addition, the process of mapping names to numbers is often a process that can be reversed. Generally it relies on a computer algorithm called a hash function. The idea is that you take the name or other identifier, such as a phone number or social security number (or in the case of the New York City taxi data, medallion and license number), and run it through the algorithm to produce a number. Every time you put in the same number, you will get the same result. But if you have the output, you cannot automatically figure out what input produced it. Unfortunately, this is not fool proof. Someone who wants to re-identify someone based on these hashes may simply start with a long list of names or other IDs, run them all through the hash function, and generate a table that maps each ID to the pseudonym. Then all they have to do is lookup the pseudonyms of interest in the table to re-identify them.

From this brief introduction to re-identification, hopefully it is clear that de-identification is tricky business. One last point: once data is released to the public, it may be impossible to take it back. Once data is downloaded, even once, the downloader may re-distribute it to others and there may be little or no legal recourse. Thus it is critical that we think through data re-identification issues before releasing data to the public.

The author’s views are his or her own, and do not necessarily represent the views of the Commission or any Commissioner.This article was originally posted on the Tech@FTC blog by Lorrie Cranor, Chief Technologist at the Federal Trade Commission (FTC).