Not a complete car crash (hopefully)

Analysis of City of Chicago crash data 2013–2021

Published in

Towards Data Science

9 min readApr 6, 2021

Believe it or not — such an occurrence would be influential in predicting an injury. (Image from Unsplash)

For the third module project of my Flatiron data science course, I imagined that we had been initially approached by the City of Chicago to analyse the traffic crash data they had accumulated between 2013 and 2021. We have been asked to first, identify problems (if any) with road safety and, if so, learn what kind of improvements should be targeted and where (geographically) those improvements on the ground may be necessary.

As a side question, the City acknowledges that more people may want to cycle going forward and wants to encourage that; hence, looking ahead, they would like to briefly know more about crashes (and injuries) associated with doorings.

First, let’s take the raw data to analyse the time series in injuries; according to the data, is there a problem with road safety?

Initial Trend Analysis

Turning the data into a time series indexed by the date of each crash recorded in the data, we were able to observe this annual trend in injuries between 2013 and 2021:

It can be seen that all kind of injuries, especially fatal injuries are trending upwards. Unless there is some other exogenous explanation for changes in the way the same data has been recorded between 2016 and 2018, it would appear that Chicago’s road safety dropped dramatically in those years and hence we can conclude that there is a problem with road safety.

Playing around with the time series, we can spot other interesting patterns with regard to injuries, like the difference in type of injury suffered according to the time of day when the crash occurred:

These subplots plot the intra-hour periods of each day against the number of a specific type of injury recorded at that hour during that day. Here, there is an interesting difference between fatal injuries (which appear more likely to happen from 10pm to 4am in the night) and other types of injuries (which appear more likely to happen from around 4pm in the afternoon).

This corresponds to how we intuitively think about fatal injuries — they tend to happen in accidents at speed and it is a lot easier to do that in the early hours. The volume of traffic that peaks on or around 4pm (approximately rush hour) would suggest that the greater the number of journeys made, the higher the chance of any injury (except fatal injuries, because traffic would presumably move too slowly for that).

A quick look at Doorings

What is a dooring, you might ask?

Dooring - Wikipedia

Dooring is a traffic collision or crash in which a bicyclist (or other road user) rides into a motor vehicle's door, or…

en.wikipedia.org

Ouch!

There were 1,053 recorded doorings in the entire dataset. As a first step, we thought it would be interesting to leverage the time series again and see the distribution of injuries caused by doorings which occur at particular times of the day:

Rush hour is the worst (Visualisation by author)

To the extent that doorings cause any injury (and they cause at least one injury in 70% of the doorings in the data set), the grand majority of those injuries are non-incapacitating; they may not be serious, but not any less painful.

The danger time for doorings must surely be 18hrs each day; at rush hour, when vehicle usage and cyclist traffic is high, resulting in more doorings generally. The one fatality in the dataset due to a dooring was recorded at this time.

If the City wanted to put in cycle lanes to encourage cycling as a form of commuting, it might want to analyse this map, which shows doorings where occurred on the city streets in 2019 (the last full year for which traffic was “normal”). Hotspots can be identified with this map and, with further research on the ground, help identify ideal routes for cycle lanes.

Dooring incidents in 2019 (Visualisation by author)

Approach to the project

Now that we have ascertained there is an increasing annual trend in injuries, the analysis moves on to attempt to identify what factors lie behind crashes resulting in injury and, just as importantly, where (geographically) the hotspots for traffic injury will be in Chicago.

Our approach towards finding out what factors may be influential in crashes will be to structure the problem as one of binary classification, such that every crash recorded in the data is assigned a value of 1 if it resulted in an injury (however severe) and a 0 if it does not. We will then fit a variety of different models to see which one best predicts a crash which results in an injury. We can then investigate that model further to see which features are most influential in that model; it is this investigation which will reveal the practical direction for improving road safety. The features identified in this manner will tell the City what it needs to consider if it wants to break the trend of increasing traffic injuries.

Before all that, which model can be said to best predict crashes resulting in injury? How do we know if a model is any good anyway?

The Best Model: Precision vs Recall

At the heart of the question as to how we know a model is any good is the Precision vs Recall trade off. You can learn more about the two metrics here:

Explaining Precision vs. Recall to Everyone

Defining the Difference among Precision, Recall, Accuracy, and F1 Score

towardsdatascience.com

Taking a step back, the model we need to look at will have to reduce as far as possible the proportion of false negatives predicted. A false negative in this case would be a prediction that a crash would not cause any injury when, in reality, such a crash did actually result in injury. Each such crash missed by the model is itself a missed opportunity to learn what factors cause such a crash and equally, a missed opportunity to avoid injury (and save lives) as a result of such a crash.

Hence, the metric we really need to focus on is Recall. This is because the higher the Recall score, the lower the number of false negatives obtained in predicting crashes that cause injury.

By contrast, the metric which is less important is Precision. This is because the lower the Precision score, the higher the number of false positives obtained in predicting crashes that cause injury.

At a high level, it is far better to wrongly predict crashes that did not actually result in any injury (false positives, but arguably no harm done there) than to miss crashes which actually did cause injury but not predict them as doing so (false negatives). Hence the best model will be the one which maximises Recall.

The Best Model: Benchmarking against dummy classifiers

But how do we know if the model is any good at all? One way to approach the question is to approach it from a different angle; does our model do better in predicting injuries than a strategy of simply guessing whether a crash is going to result in an injury? Simply summarised, any model which is “good” for further analysis is one which has at least as high a Precision score as a strategy of guessing, but also one which has a Recall score which far exceeds that strategy of guessing.

Modelling & Evaluation: High level summary

Now that we have set out the benchmarks by which we will pick out a certain model(s) for further investigation, we downloaded the crash and vehicle dataset from here to work on:

Traffic Crashes - Crashes | City of Chicago | Data Portal

Edit description

data.cityofchicago.org

We first proceeded to clean each dataset of null values and simplified the (mostly) categorical features. A crucial step in this process was the recognition that the binary classification problem we were solving for was also a class imbalance problem (accidents resulting in injury are fortunately a minority class amongst all accidents). As a next step, we trained a few models from Sci-Kit Learn’s supervised learning algorithms on each dataset, using those trained models to make predictions and evaluating the extent to which those predictions were correctly made. As a final evaluation step, each model was scored for their Precision and Recall on a cross validated basis. It is important to get see these scores on a cross-validated basis to ensure that the Recall metric obtained is robust.

Let’s have a look at the results of the process just described:

And the results are in…..(Data by author)

The first thing to note about the results is to note the performance of the dummy models in the table above. These are the models with names ending in most_frequent, prior, stratified and uniform such that crash_most_frequent refers to a dummy model trained on the crash dataset, but picking the simple guessing strategy which always predicts that none of the crashes will result in injury.

With this background, it can be seen that for a model to be any “good” it has to beat vehicle_uniform and crash_uniform with a cross validated Recall scores of around 0.5 (it also has to have a better Precision score than 0.17). From this perspective, the best models are vehicle_xgb and crash_logreg. These are the models where we peer under the hood and see which features make them tick; these are the features we need to address in any road safety plan we recommend to the City.

Feature Importance: Logistic Regression trained on crash data

The comparison of the top 10 coefficients taken from the logistic regression model trained on the crash data tells us that speed is influential in this model in terms of predicting whether a crash results in an injury. ‘x15_40’, ’x15_60’, and ‘x15_80’ are all features that indicate whether the speed limit at the scene of the crash was under 40mph, under 60mph or under 80mph. It would therefore appear that the higher the speeds allowed in the area where the crash happened, the more likely that crash will result in injury (and particularly serious injury at that, given what we know about fatal injuries, their timing and our speculation that speed is perhaps behind that pattern).

Speeds have a big impact on injuries (Visualisation by author)

Feature Importance: XGBoost trained on vehicle data

Lifting the hood under the XGBoost model trained on vehicle data, the first contact point of a vehicle (as implied by feature name ‘x4_Other’) is influential in deciding whether the crash results in an injury. In the process of cleaning up the vehicle database, crashes which result in total destruction of the vehicle (and in damage to the roof) have been categorised under this feature. Our interpretation of this model is that where there is extensive damage to a vehicle or if a vehicle is hit anywhere other than the front, side or rear, the crash is going to result in an injury.

When the first contact point is “Other”, you are in big trouble….(Visualisation by author)

Where are the accident hotspots?

From our investigations of the top models, our recommendations to the City should be to incorporate measures that reduce overall speeds and look at road redesigns aimed at reducing the heavy vehicle damage (both factors being influential in predicting crashes resulting in injury). Armed with that knowledge, we need to know where to apply it. The hotspots for injuries and fatalities are set out in the interactive map presented below as the answer; sections of the city are organised by police beat numbers. It looks like beat numbers 331, 815, 834 are all hotspots which merit further investigation.

Total Injury and Fatality hotspots across Chicago between 2013 and 2021 (Visualisation by author)

Conclusions

There was more we could have done with additional data. There was actually more data to work with at the City of Chicago portal, including an entire dataset relating to drivers. We wanted to do the same level of analysis on hit and runs in the dataset that we did with doorings. The code for all exploratory data analysis, modelling and visualisations in this blog can be found in the github repo below:

hsinhinlim/Flatiron-Module-3-Final-Project

Contribute to hsinhinlim/Flatiron-Module-3-Final-Project development by creating an account on GitHub.

github.com