5 Q’s for Kristian Lum, Lead Statistician at the Human Rights Data Analysis Group
The Center for Data Innovation spoke with Kristian Lum, lead statistician at the Human Rights Data Analysis Group (HRDAG), an international human rights nonprofit. Lum discussed the challenges of getting accurate data from conflict zones, as well as her concerns about predictive policing if law enforcement gets it wrong.
This interview has been lightly edited.
Joshua New: Can you discuss the value of accurate data when it comes to protecting human rights around the world?
Kristian Lum: When we talk about the accuracy of data, it is useful to break this down into at least three categories: lack or minimization of recording or recall errors, representativity, and completeness. The first category is what most people think of when they think of “accurate” data. In the context of data on the human rights violations we study—typically killings—this pertains to whether the narrative detailing the killing was factually correct, whether the name of the victim was spelled correctly, whether the person reporting the killing remembered the precise date and location of the death, and, of course, whether the person recording this information wrote it down without introducing any errors. Accurate data in this sense of the word is vitally important for memorializing specific victims and providing the types of moving narratives that journalist, activists, and lawyers use to make cases regarding specific incidents.
The second category—representativity—is a less common interpretation of this word but equally important. To ask whether the data is representative is to ask whether all victims of the conflict are equally likely to be recorded. For example, if there is a conflict that goes on for two years, if 90 percent of the victims were killed in the first year, then about 90 percent of the victims recorded on a representative dataset would have been killed in the first year. A dataset in which 50 percent of the victims were killed in the first year and 50 percent in the second year would be non-representative because people killed in the first year were less likely to be recorded than those killed in the second year.
Using non-representative datasets to make inferences about the dynamics of the conflict is problematic. In the simple example I’ve given, if you used the non-representative dataset to make inferences about the conflict you would conclude that there were equally many deaths in the second half as in the first. Making conclusions on the basis of non-representative conflict data has potentially large ramifications, as the spatio-temporal pattern of killings is often used in retrospective policy analysis. For example, if some new policy were introduced at the beginning of the second year, on the basis of the non-representative dataset, we’d likely conclude that it was ineffective because the dataset we are using to measure the number of killings indicates no reduction in killings following its implementation. Thus, spatio-temporally representative knowledge of the number of killings is very important to constructing an accurate historical narrative of the conflict as well as facilitating future understanding of the effectiveness of various interventions.
Representative data on killings is also important to establishing the demographic profile of the victims and whether specific groups were targeted for a genocide. Returning to the former example, if ethnicity “A” made up 90 percent of the victims but only 50 percent of those reported, on the basis of this data one might conclude that ethnicity “A” was not targeted when in fact they were. So, again, having “accurate” information in the form of representative data, or estimates thereof, is of the utmost importance to establishing the targeting of specific groups and holding those who target those groups accountable.
In the context in which we work, a dataset that is accurate in the sense of being complete would entail the documenting group to have knowledge of every single killing in a timely manner. It’s worth noting that a complete dataset is necessarily representative because every victim has the same chances—a 100 percent chance—of being recorded. In our experience, this is not possible. Perpetrators of human rights violations try to conceal the number of the killings taking place, and the families of the victims may be reluctant to come forward because they fear retaliation. Although it’s not possible to obtain in one dataset, accuracy in this sense of the word is important for establishing the magnitude of deaths in the conflict.
New: For several years, HRDAG has been producing statistics about death tolls in Syria. Can you explain how you deal with the challenges of getting accurate data from a country in the midst of a chaotic civil war?
Lum: As discussed in the previous question, we never assume we have accurate data during a time of conflict. To support understanding of the conflict, we use several incomplete, potentially non-representative datasets to obtain estimates of the death toll and how it has varied over time and geography. This allows us to correct for the incompleteness and non-representativity in the data collected on the ground.
To do this, we use statistical models that are designed to combine such datasets to produce an estimate of the total number of deaths not documented on any of the lists. These models calculate the overlaps among the lists and, very, very roughly speaking, work on the intuition that if there are a lot of overlaps among the lists then the number of killings likely isn’t a whole lot larger than what has been observed across all of the lists, as several lists have failed to find a large number of different victims than were found by the others. If the lists do not overlap very much, then we infer that there are likely many killings that have not been recorded because each list is finding whole new sets of victims. Of course, this process is much more mathematically rigorous and principled than I just described, but that’s the crux of it. If you’re interested in following this research thread, these methods are called “capture-recapture,” “multiple recapture,” or “multiple systems estimation.”
New: HRDAG is working with groups in Mexico to use machine learning to identify hidden mass graves. How do you go about this?
Lum: This is a relatively new project that my colleague, Patrick Ball, is getting off the ground, but so far it’s looking really interesting. He’s using data on demographic, economic, and infrastructure-related variables for several small regions in Mexico to train a machine learning classifier that predicts whether a mass grave has been found, or not found, in that region. For regions where an exhaustive search hasn’t been undertaken yet, these models can then be used to process the demographic, economic, and infrastructure variables for the unknown regions to predict whether a mass grave will be found there. These models may be used to expedite the search for more mass graves by helping to narrow down the regions where they are most likely to be found.
Interpreting these predictions is a little bit tricky. In the search for mass graves, in the dataset, absence of presence is not the presence of absence. That is to say, just because no grave has been found in a region does not mean that one does not exist. Because data on where people have located mass graves is used to train the models, predictions from these models pertain to predictions about where people are likely to find mass graves. In some cases, the models may not predict a mass grave in a region not because none exist. This could happen because the mass graves in similar regions are more difficult to find or less resources have gone into searching those regions, and so the model learned that graves are unlikely to be found there.
New: HRDAG makes a lot of the data is uses publicly available, but warns against using raw data due to concerns about selection bias. Can you explain these concerns?
Lum: Unfortunately, even in the best of times, obtaining error-free data that is complete or at least representative is difficult. In times of conflict, it is even harder. Regarding the first interpretation of accuracy, we don’t assume or require that any individual report is completely error-free. Sometimes names are misspelled or the precise date of death is unknown. We handle these sorts of errors by matching individual reports across multiple datasets to find which information is most consistent in the reports of the same death. In order for us to draw accurate conclusions from data with some errors in reporting, a large portion of the data must not be fabricated. Small errors are manageable by leveraging multiple sources, as described.
In terms of representativity and completeness, we never assume that any dataset we are given is a representative sample. Obtaining a representative sample is difficult and does not just happen by chance. It requires a strong concerted effort. The mandate of the organizations collecting the data is not to collect a representative sample, but rather, to document the information they receive and remember the dead. Given this mandate and the difficult conditions in which all of the data collection groups work, they are all doing an excellent job at performing the task they set out to accomplish. Inaccuracy in any of the interpretations discussed isn’t evidence of a job poorly done but a different goal. We build on their outstanding work to take incomplete, non-representative data and apply statistical methods to produce scientifically defensible estimates of the dynamics and magnitude of killings in the conflict.
New: HRDAG has cautioned against over reliance on predictive policing technologies to avoid reinforcing biases hidden in the data they are trained on. However, relying less on humans also creates the opportunity to remove human bias from the decision-making process. How should law enforcement agencies approach this technology to make sure they use it responsibly?
Lum: The point we have tried to make in our work on predictive policing is that relying on such technologies does not actually remove human bias from the decision-making process; it automates it.
Machine learning models in general learn patterns in the data they are given, regardless of whether that data is representative. Predictive policing, speaking broadly, is machine learning applied to police records. Similar to the above discussion of human rights groups who collect data, the police are not tasked with obtaining a perfectly representative sample of crimes. Representative samples are hard to obtain, and it won’t accidentally happen by chance. They cannot refuse to document a crime that has been reported to them simply because similar crimes reported by similar people are over-reported in proportion to other crimes in their jurisdiction. So, because there is variation in the rate of reporting crimes, their data will necessarily be non-representative.
It may also be non-representative for other reasons. Police may have allocated more resources to patrolling certain communities and police may have concentrated their efforts on people who conform to their preconceived notions of what a criminal looks like—I believe this is the human bias you are trying to eliminate from the decision-making process in the question. If either of these are true, this will appear in the police data as more records for crimes committed in the more heavily patrolled regions or by people who are more likely to be investigated. That is, the human bias we’d like to eliminate makes its way into and is encoded in the police data. Predictive policing algorithms built on this data will learn these biases and reproduce them in their predictions. Oversimplifying this quite a bit for the sake of brevity, if a minority community has been over-policed and thus has more records of crimes committed there than another less policed neighborhood with a similar level of criminality, predictive policing models will infer that the minority community is more criminal and suggest more police spend time there. In this way, the predictive policing model would be reinforcing the human bias that previous police practices encoded in the data rather than eliminating it.
I should be clear that I don’t think predictive policing is a very good idea, at least not without some serious thought into the potential negative impacts an automated system that reinforces and in some cases amplifies historical police biases could cause. However, if it is inevitable that more police departments will adopt the technology, the question of how it can be used responsibly is a good one. Finding way to incorporate other sources of data—maybe data from social workers or hospitals—may add a different perspective on where crime is taking place and by whom than models that use only police data. Another avenue that is being pursued is to train models only on data that is community-reported. This raises the question of whether we want policing to be a customer-response service, as predictions made on the basis of this type of data are predictions of where police response will be requested. These will not be general-purpose predictions of where crime is happening, as it will not predict crimes that occur but do not result in a call for the police. To be used responsibly, it will be very important for police to understand this distinction.