Shark Fatality Predictor

Improvements and Adjustments

Recency Bias: Due to the fact that attacks seem to occur in spurts and the locations change over time, the model itself is fed 2 additional features, the number of fatalities close to the location and a number that represents how recently those fatalities have occurred, with a logarithmically decayed weight over a window of 10 years. I have also added a post prediction notoriety boost which counts all the fatalities within 20km of the subject location for all recorded time and adds a flattening boost (to a theoretical max) for locations or beaches where there are a lot of records indicating fatality. In the end I tested the model with Reunion specifically because I had to balance the 2 new ideas. There have not been any attacks recorded since 2019 in Reunion and that too after a long time. That makes the predictor less aggressive towards Reunion. Conversely, Esperance beach Australia has had a surge of Fatal attacks so the model is recency biased towards higher probability of Fatality. New Smyrna Beach on the other hand is the most notorious beach for attacks and fatalities so it gets a notoriety boost.
Recall Threshold: The Fatal to Non Fatal proportion is quite low as one can imagine, check the stats on the next page, about 30%. As such the distribution is a little skewed towards the non fatality side, which means that the learning algorithm produces a model more accurate at predicting non fatalities as compared to fatalities, that's because it gets 3 times as many rows to be taught about non fatalities. In effort to compensate for this, I performed a sweep test as seen in the model training image above, to determine a threshold boost to the model output's predicted probability so as to improve recall numbers up to 0.7. To put simply, the model predicts Fatality if predict_proba is above 0.5 and as such used to miss almost half the real Fatalities with a recall of 0.5. By boosting the output probability a tad, we capture many more actual fatalities from the test data to bring up the recall values to 0.7 meaning 7/10 fatalities will be 'correctly' predicted as per the test data which more closely represents the model accuracy, I mean I can sacrifice some of the the 90% Non fatality prediction by a few points in order to capture the Fatality prediction more accurately which is in the end the aim of this predictor.