Statistical test of 2118 fatalities involving law enforcement

After the events in Ferguson, MO last year, police related fatalities have become a major talking point in the United States. One of the most common claims made during this discussion is that a black people are more likely to be killed in police altercations. This claim is almost always backed up by listing the many cases that have been seen in the news recently.

While these lists might hint that there is a racial bias in these fatalities, they are in no way proper statistical evidence. This is the equivalent of accusing someone of cheating at a game of dice and listing 5 or 6 of their recent wins as evidence. It is possible that they are a completely honest player on a hot streak. For proper evidence, we need to complete a hypothesis test.

What is a hypothesis test?

A hypothesis test is like a courtroom for numbers. All claims are assumed false until proven true beyond reasonable doubt.

For example, we can test whether a dice is loaded by rolling it 600 times. Assuming the die is fair, we would expect to roll each side 100 times. After completing 600 rolls, we get this distribution.

Selection_004Clearly this die is rolling more 6s and less 1s than it should be, but that doesn’t instantly mean that it is loaded. We complete a statistical test to answer the question: “What is the probability a fair die would produce a distribution as divergent or more divergent from the perfect 100-100-100-100-100-100 distribution than this one?” I won’t bore you with the math, the probability in this case is .07. This means that a fair die has a 7% chance of producing a distribution as divergent as this one. 7% is a large enough chance that we haveĀ reasonable doubt that the die is loaded, so it is not strong enough evidence to make any conclusions. Therefore, since a claim is false until proven true, we say that this die is not loaded.

Testing police-related fatality data with respect to race is a crowd-sourced database of fatalities involving law enforcement. This data lets us create a scenario very similar to the dice test we just completed. To find the observed data values, we simply count how many people of each race are in the database. The expected values required a little more thought.

We could have just taken the national U.S. racial distribution and extrapolated from there, but that would not have been accurate because these incidents are not evenly distributed around the U.S. Instead, we can take the zip code that each death occurred in and add the local racial distribution to the total (population data from For example, Ferguson, MO has a 36% White, 61% Black, 1.6% Latino/Hispanic, .6% Asian and .4% Native American population, so for each entry in Ferguson we would add .36 to White expected, .61 to Black expected, .016 to Latino/Hispanic expected, and so on.

This is the result of that process.


diff There seems to be a bias toward black in the data. We must still complete a proper test to be sure. The answer to the question: “What is the probability that we would see a distribution as or more extreme than this one assuming that race plays no factor in police-related deaths?” is on order of 10-82. For reference, the chance of a person getting struck by lightning 13 times this year is on order of 10-79. This is very strong statistical evidence that there is a systematic difference in the race of individuals killed by police actions.


Of course, there is the obligatory correlation does not mean causation reminder. This evidence tells us that a discrepancy exists, but it says nothing about why the discrepancy exists. Also, I am putting faith in the legitimacy of the data from If this data is biased in any way, than any test done on it will also be inaccurate.


How Good are we at Giving?

The season of giving is here, so it is a fitting time to analyze the way Americans donate.

One of the biggest viral events of 2014 was the Ice Bucket Challenge, which drastically increased donations to the ALS Association. While donations to any cause is a wonderful thing, it is debatable whether ALS was the best cause to receive this influx of money. According to the ALS Association, only 5,600 Americans are diagnosed with ALS per year. This statistic is completely trumped by the estimated 224,210 Americans who were diagnosed with lung cancer this year.

This discrepancy interested me, so I took information from 9 of the biggest health-related charities and compared not just donations, but donations relative to the number of Americans their disease affects.

Sources: , ,
Sources: , ,

It is pretty clear that we could be doing better. I think that the disconnect you see is due to marketing campaigns such as the Ice Bucket Challenge, MLB using pink bats on mother’s day, and Movember. Other causes could be the reputation a disease has. Lung and Pancreatic cancers are caused mainly by smoking, so people may feel less sympathy for people affected.

If you are planning on donating to charity this holiday season, it is really easy to just pick a big name charity and be done with it. However, I encourage you to spend a few minutes researching to find out what causes are in the most dire need of help. If everyone donated this way, the charts above might be a little more balanced.

A Year of the Group Chat

Everyone knows that our digital lives are tracked. When we click ‘I agree’ for a service like Facebook, we expected it to store some of our data. Despite this, the amount of data in your Facebook archive might still shock you.

I do not know why Facebook needs a complete history of my pokes, over 20mb of my messages, every single event I have ever attended, and a year long log of my Facebook logins (including timestamps, IP addresses, and even the browser type used). However, you can do some neat things with this amount of data, so I’ll put my tin foil hat away for now.

I wrote some code to analyze the largest of my message threads, which is a year old group chat with 20 of my high school friends and over 50,000 messages. After counting word frequencies and removing very common words (e.g. a, and, the) I was able to make a giant word cloud. The size of each word is proportional to how frequently it was used.

Cloud 1(1)

Also, counting message frequency by name was also a simple task. Here is a leader board of who sends the most messages.


There is much more analysis that can be done. I didn’t even touch the message timestamps, which could help visualize sleep schedules and answer random questions (e.g., How much did message traffic slow down when we all went off to college? How much more frequent was the word ‘prom’ in the months leading up to it?). However, it’s winter holiday and I am refusing to code any longer. That will be a project for another day.