Car Crash in NSW from 2016 to 2020
Click here to see my Tableau report.
Click here to access all analysis files (cleaning data, knn model in Python)
Click here to read my final report
Skills Involved :
Data cleaning in Python
Tableau : sheet switch, forecast, story
K-nn model in Python
Witness of an accident can have difficulty to communicate: shock, misinterpretation of injuries, language barrier. My question was : Can a operator predict if a car crash caused injury, without information from witness?
This project took me 5 days.
Here is a quick presentation of my project, feel free to read the files on my Github, and my report for more details.
Part 1 : Clean the data
I had to determine the sure information an emergency operator can have simply by answering the phone. They can have the time of the call, the location of the call. From the time and location, they can deduce the weather, and from the location only, they can any information about the road (speed limit, city, …)
Can a operator predict if a car caused an injury based on location, weather, and time of the accident?
In a Jupyter Notebook, I clean my data to keep only the features I needed.
Part 2 : First Analysis
I decided to look at distribution of car crashes over the time.
I extract patterns in number of accident and find some causes of accident.
More crashes happen during commute time, particularly from work to home. Tiredness and stressful work increase risk of crash.
You can find in another notebook how I studied seasonality in number of car crashes.
Part 3 : Impact of few features in crash’s severity
This part was really interesting because it shows me an unexpected result that encourages me to study more data.
As you can see on the chart, crashes that happen on the road limited to 50 km/h or less generate in average more injuries than ones that happen on high-speed ways. But my first impression was that crashes that happens at high speed are more dangerous.
Road limited to 50 km/h or less is in the city centre, where there are a lot of pedestrians. And more than 95% of pedestrians that are hit by a car suffer injuries.
You can find all the graphs in the Tableau file.
Part 4 : Model
I tried to divide crashes in two categories : with or without injuries.
I run different knn-model in this Notebook.
I face few problems. Because most accidents happens in Sydney metropolitan area, that is a small area relatively to NSW, Longitude and Latitude are slightly correlated. So I run my first model without using GPS coordinates, by using road classification, and road speed limit.
But finally, it was more accurate to say that every crashes cause injuries than to use one of my model.
And the more neighbor I add, the more my model looks like baseline model.
Correlation heat map between features used in my knn model.
Error of my knn model on dataset made for testing.
CONCLUSION :
Crashes are more likely to happens in city, where car density is the greatest, particularly in Sydney metropolitan. Like in few others countries, commute time from work to home is the most dangerous. But number of crashes decrease over the year.
Unfortunately, it is impossible to predict severity of crash without relying to witness information and emergency operator experience.
What I learn and how to go deeper ?
An unsuccessful model is an answer by itself, having a result is satisfying, but the absence of conclusive result is information. In my case that means that driver behaviour must be more important than whether or location of an accident. Moreover, I discovered that some data are linked differently than expected by observing that crashes are more likely to cause injury at low speed. That is a very interesting part of data analysis. It is important to not be biased by our first impression, and try to understand the relationship between data. Finally, document everything little by little is very important. By using a Jupyter notebook, I keep all information about my data cleaning, but I have to execute it again to document it clearly in my report.
As a parallel analysis, I would like to study dataset with data about cause of the crash (age of the driver, alcohol, overspeed, no respect of road sign,…) Because my model lead me to think that human mistakes have more impact on crashes than external factors.