White Paper

From Columbine to Pensacola: Two Decades of Mass Shooting Violence in the United States, 1999 to 2019

Motivation

For my final project in the Introduction to Data Visualization and Design Fundamentals course, I decided to analyze two decades of mass shooting violence in the United States from 1999 to 2019. I chose this topic because mass shootings are a hot topic in the United States today and it seems that mass shootings are becoming more common in the United States. I wanted to determine if there were any trends in the number of mass shootings and victims over time, the locations that have the most total victims, the predominant race of mass shooters, and the predominant gender of mass shooters.

Data

The data used for this analysis came from Mother Jones. It is called the “Mother Jones – Mass Shooting Database, 1982 to 2019” and is a database that is kept up to date by Mother Jones for public use. The mass shootings in this database follow very strict criteria, which is set by Mother Jones, in order to be included in this database.

In order for the mass shooting to be included in this database: the perpetrator must have killed at least four people (after 2013 this was revised to three people), the killings were a result of a lone gunman (except in the case of the Columbine massacre and the Westside Middle School killings), the shootings were in a public place (except in the cases of a party on private property in Crandon, Wisconsin, and in Seattle, Washington, where there was a public crowd), and the shooting was not a result of gang activity, armed robbery, or mass killings in private homes. The victim tallies do not contain any perpetrators who were injured or killed during the attack.

I used the case, date, fatalities, injured, total victims, location type, shooter, race, and gender variables from the database in my analysis. The data did not require much cleaning since it was made as a ready-use database, but I did edit location types to be more location specific and I did have to create the shooter variable.

The Visualizations

I created a total of seven visualizations in the course of my analysis.

The first visualization is a timeline of the mass shootings. I chose this visualization because I wanted to represent each data point over time. I decided to split the timeline in half to represent a decade each. This way the timelines could be used to compare the decades to each other.

The second and third visualizations are scatter plots with trend lines. I chose scatter plots for these visualizations because it was the easiest way to represent and see relationships between the variables of interest. The first scatter plot shows the amount of mass shootings over time. The trend line for this visualization is a 3rd order polynomial trend line, which means that mass shootings increase at an exponential rate over time. The second scatter plot shows the total amount of victims over time. The trend line for this visualization is also a 3rd order polynomial trend line, which means that total victims per year increase at an exponential rate over time.

The fourth visualization is a tree map. I chose this visualization because it is a “part of a whole” type visualization. The size represents the total number of victims for each case so that the size of each square in the visualization represents proportionally how many victims out of the total amount of victims that case represents. The larger the square, the more victims that mass shooting has as compared to the smaller squares.

The fifth visualization is a stacked bar chart. I chose this visualization because it was the best way to show the total amount of victims for each location type, while breaking down the total amount of victims by case. It helps us to see which location types have the most and least shootings as well as which location types have the most total victims without having to break this into two visualizations.

The sixth and seventh visualizations are waffle grids. I chose this visualization because it is a “part of a whole” type visualization. Each square in the visualization represents 1% so it is a great way to represent percentages of variables and compare variable amounts. The first waffle grid compares black mass shooters to white mass shooters and shows us that mass shooters are predominantly white, 55%, as compared to 15% black mass shooters. The second waffle grid compares male and female mass shooters and shows us that most mass shooters are predominantly male, 95%, as compared to 3% female mass shooters and 2% male and female mass shooter duo.

Closing Remarks

While the tree map is probably the visualization with the most visual impact in this analysis, it is also important to note that this is just a comparison of the amount of victims as a result of the mass shooting. It does not mean that the larger the total victim count the more devastating the shooting is. Each mass shooting is devastating in its own right and has had impacts that we could not visualize since it is subjective and we are only looking at the objective data. Due to the nature of the topic it is hard to measure the impact each mass shooting has had since it impacts families, friends, communities, and even the public as a whole in some or most instances.

Next Steps

Given the time, and the data, I would have loved to take a deeper dive into this topic. Specifically, I would have loved to have looked at whether the mass shooters had a history of mental illness and what weapons they used and if they were obtained legally or not.

I would have also liked to investigate a specific location type more closely to determine if there are trends that have been overlooked when looking at the overview of all mass shooting location types.

If I had a list of victims of each mass shooting, I would have loved to create a word cloud for each mass shooting to recognize the victims of these shootings. The victims of these senseless acts of violence deserve to be recognized more than their killers.

From Columbine to Pensacola: Two Decades of Mass Shooting Violence in the United States, 1999 to 2019

Mass shooting violence in the United States is an on-going problem that seems to get worse as time goes by. There has been many pushes for adequate gun control laws to try and limit access to semiautomatic weapons, which are the most common types of weapons that are used in mass shootings, but the problem still remains.

For this project, I will be using the “Mother Jones – Mass Shooting Database, 1982 to 2019” a database that was compiled by Mother Jones and is kept up to date to investigate mass shootings in the United States from 1999 to 2019. I was interested in determining if there are trends in the shooter’s gender and race as well as trends in the amount of shootings over time and the amount of victims over time.

The information garnered from these visualizations and this investigation overall could be used to used to show what types of places and locations are more prone to mass shooting violence so that those places could improve security and police presences. It could also be used to bring attention to mass shooting violence in the United States to create a push to [hopefully] stop this type of violence.

To start off my investigation, I created two timelines that each depict one decade of shooting violence to determine trends in mass shooting violence over time. The first timeline is from 1999 to 2009 and the second timeline is from 2010 to 2019. By comparing the two timelines, we can easily see that there was much more shooting violence in the second decade, and in recent years, than there was in the first. It is also important to note that in 2002 there was no mass shootings and was the only year where there were no mass shootings.

As a followup to the timeline, I wanted to analyze the number of mass shootings and the total number of victims over time. To do this I created two scatter plots and added trend lines to them. The first is a scatter plot of the number of shootings over time with a polynomial trend line with a power of three. It shows that as time increases so does the number of shootings. This trend is significant at the 1% level as the p-value of the trend line is less than 0.0001. The second is a scatter plot of the number of victims over time with a polynomial trend line with a power of three. It shows that as time increases so does the number of victims. This trend is significant at the 5% level as the p-value of the trend line is less than 0.05. It is important to note, there are no entity fixed effects in the regression analyses above. This could have affected the coefficients as well as affected the p-values and standard errors.

I then created a tree map to analyze the total amount of victims as a result of each instance of mass shooting violence. In doing so, I determined that the Las Vegas Strip massacre perpetrated by Stephen Paddock had the most victims by far. To round out the top three largest mass shooting cases, it was followed by the Orlando nightclub massacre perpetrated by Omar Mateen and the Aurora theater shooting perpetrated by James Holmes.

Next, I wanted to investigate the location types of mass shootings. To do this, I created  a stacked bar chart which shows the total amount of victims as a result of that type of mass shooting. It shows that the most victims were a result of concert mass shootings followed by other types of mass shootings, school mass shootings, workplace mass shootings, nightclub mass shootings, religious mass shootings, military mass shootings, multiple (spree) mass shootings, and festival shootings. The concert location type had the most victims even though it only accounted for two shootings because of the Las Vegas Strip massacre.

Finally, I wanted to analyze the gender and race of the mass shooters. To do this, I created two visualizations, both of which were waffle grids. The first waffle grid was used to compare black mass shooters to white mass shooters. It was determined that white mass shooters were the dominant race of mass shooters as compared to black mass shooters and mass shooters of other races. The second waffle grid was used to compare male mass shooters to female mass shooters. It was determined that male mass shooters were the dominant gender of mass shooters as compared to female mass shooters and male and female duo mass shooters.

Given the time, I would have liked to take a closer look at a single category and do an in-depth analysis of just that category of mass shootings to determine if there were any trends in gender, race, total victims, amount of shootings, or any other trends that may have gotten overlooked in the overview of mass shootings that I have done in this instance.

Quantified Self

Quantified self refers to the use of self-tracking methods with either physical or technological means to help you gain more insight into yourself. It has recently been made more popular by wearable technologies, such as Fitbit, which have the ability to track a multitude of personal data such as heart rate, sleeping patterns, and steps, among other things. In the past it was most commonly used in the healthcare field to help patients and doctors keep track of vital statistics to assess the health of a patient. By keeping track of personal data, we are able to identify trends and can use the data collected to make decisions that could improve our quality of life.

In this project, I chose to create a data set by keeping track of each purchase, whether in cash or credit, that I made in the week of October 17th, 2019 to October 23rd, 2019. My data set contains the date, which is the date of the purchase, the store where the purchase was made, the type of store (coffee shop, grocery store, restaurant, metro card), the item type (food, metro card), and the amount I spent on the purchase.

I chose to keep track of my spending habits so that I could actually visualize what I am spending my money on since I normally do not keep track of my spending and I thought it would be a good idea to see where I am spending the most money and determine if I needed to spend that much money.

The first visualization is a line chart that I created to show how much money I spent each day.

In the second visualization, I created a pie chart to show how much money I am spending at the types of stores I frequent. I then created a third visualization, which is a second pie chart, to show how much money I am spending on the types of items I buy. I found that even though I frequent four types of stores I am essentially buying only two types of items, food and metro cards. I spent the most money on food items by far and of that the most was spent at coffee shops.

In the fourth visualization, I created a tree map to show how much money I am spending at each store. I determined that I spent the most money at Starbucks with $29.68 being spent there in a week, followed by Metro cards at $25.50, and the Halal Cart with $18.00 to round out the top three.

In the last visualization, I created a stacked bar chart to how much I spent at each store per day. I created this visualization to determine if there were any trends in my spending.  From this visualization it seems that I spend the most on weekdays although the amount of stores I visit and the amount I spent at each store varies day by day.

Given the time, I would have liked to expand the data set to include a longer range of time to determine if there actually are any trends in my spending as well as determine what I spend the most money on in a month. Based on the pie chart and tree map, I will assume the answer will most likely be coffee.

Illness Caused by Drinking Water in NYC from 2010 to 2019

Waterborne illnesses are caused by drinking contaminated or dirty water that has been tainted with disease-causing bacteria or pathogens and account for approximately 3.4 million deaths each year worldwide. These types of illnesses are most common in developing nations, also known as “third-world” countries, as these nations lack adequate water filtration systems that are necessary to provide safe and clean water to its inhabitants.

The United States, as a developed nation or “first-world” country, has a low rate of waterborne illness due to drinking contaminated drinking water since we have adequate filtration systems in place, but there are still problems regarding water quality as most famously demonstrated in Flint, Michigan.

Using the 311 data, I decided to investigate the incidences of waterborne illness in New York City and investigate water quality in New York City to determine if there was a relationship between the two. I was interested in investigating the relationship between incidences of waterborne illnesses and water quality in New York City because I wanted to determine if there was a relationship between the two or if they are unrelated and the incidences of waterborne illnesses were isolated incidents caused by other factors.

The information garnered from these visualizations and this investigation overall could be used to help city and government agencies, the Department of Health and Mental Hygiene and the Department of Environmental Protection specifically, to determine which areas are in need of water quality improvement and to determine if there any factors which could contribute to the causes of waterborne illness in the boroughs or areas where the rate of waterborne illnesses are high or clustered and take measures to prevent outbreaks.

To start off my investigation, I created a visualization that shows where the incidences of waterborne illnesses are on a map and are separated by year to show how these incidences change by location and if there are any clustered incidences.

I then created several visualizations to show the relationship between waterborne illness and water quality. The first two are pie charts that show the percentage of complaints each borough has made for waterborne illnesses and water quality respectively. This visualization was chosen because it was the easiest to see from which borough the most incidences and complaints are coming from. They do not seem to support the hypothesis that waterborne illness and water quality are correlated since Brooklyn has the most incidences of waterborne illness and Queens has the most water quality complaints.

The second two visualizations are line charts that show the amount of complaints per year for waterborne illnesses and water quality respectively. This visualization type was chosen because is was the easiest to see trends over time. These graphs also do not seem to support the hypothesis that waterborne illness and water quality are correlated. Waterborne illnesses decrease from 2010 to 2011 then increase until 2016 where it then decreases to present, while water quality complaints are stable from 2010 to 2015 then increases until 2018 then decreases until present.

The last visualization is a scatter plot showing the relationship between waterborne illness and water quality. To make this graph I had to transform the data by including the count of waterborne illnesses and the count of water quality complaints separated by year and by borough so that there were a total of 50 data points that could be used instead of 10 if it was by year or 5 if it was by borough. After plotting the 50 data points a linear trend line was added to determine if there was a relationship between waterborne illness and water quality complaints. The relationship was found to be Waterborne Illness = 17.3682 + 0.00235605*Water Quality with a p-value of 0.148401, meaning this relationship is only significant at the 15% level, and an R-squared value of 0.043014 meaning only 4.3014% of the variance of waterborne illness is explained by water quality. These measures show that this is not a very significant relationship and was further proven when I calculated the correlation coefficient between waterborne illness and water quality, which was found to be 0.207 and is indicative of a weak positive relationship.

It is important to note, there are no entity or time fixed effects in the regression analysis above. This could have affected the coefficient on Water Quality as well as affected the p-values and standard errors.

In the future, I would like to expand the time range of the data set to determine if there is a stronger relationship between waterborne illness and water quality with more data points. I would have also liked demographic information to determine if race plays a part in the instances of waterborne illnesses since some races are naturally resistant to some illnesses and some are more sensitive to them.