So one thing that some people know about me is that every week I spend a lot of time sitting in the offices of a university's science department with not much to do and somedays I learn some interesting things. Yesterday was one of those days as I overheard a conversation between two professors.
The biggest thing that I took out of this conversation and the thing that all of you should understand is the saying all models are wrong but some are useful. This is not just a saying but it undeniably a statement of fact, anyone who has ever looked at a weather forcast then been surprised about what the weather was the next day should recognize this, forcasts are models that are never 100% accurate.
The problem that came up in the conversation these professors where having was that this saying that they teach everyone who takes a 100 level science class has been forgotten, in fact in the context of scientific research that is relevant to current events it has become politically incorrect to suggest that the models which many of today's public policies base themselves off of could be wrong despite the fact that we now have actual data proving that these sorts of models from under a year ago are wrong.
I am currently in a data science class, data science is really just a fancy term for making and analyzing these models which are always wrong. The scariest thing that I have come to realize in this class is that there is no emphasis on the fact that these models are always wrong, instead we are taught to use terms such as "this model has an 85% level of accuracy" or "this projection has a standard of error of 4%". Terms like these are missleading and deceptive. Here's an example:
So last week in my data science class we had an assignment where we were given a dataset provided by Spotify that contained the attributes of a ton of songs from the last century and a rating of how popular they are for current Spotify users. We were tasked with building models using this data to predict whether or not a model would be a hit and whether or not it would flop. My group produced two models for this assignment, one that has a 74% level of accuracy and one that had an 86% level of accuracy. Guess which one of these I would say is more reliable. If you guessed the one with higher accuracy you're wrong.
I was the one in my group in charge of creating the model to project what new songs would be total flops, before I started building any model I did a few things to be able to visualize and better understand the data I had been given. I found two things, the first of which we had already been warned about. First Spotify rates song's popularity based on their current popularity rather than their historic popularity, so the popularity factor in that data was biased twards more recent songs, and second, with the exeption of 2020 only the most popular songs of a given year were included in the data set. I concluded that the data we had was great for a model to precict hits but would be of little use in predicting flops because with the exeption of 2020 songs noting in the data set was a flop. So I based my model off of only the 2020 data then spent two paragraphs explaining why that model should be used with caution.
The scary part about this was that I was the only one in the class who had spent the time to make these observations, since most people doing that sort of thing (especially students) will just go ahead and assume that the data they have is suitable for what they want to do with it. The even scarier part about that was that after overhearing that conversation between the two professors about all models being wrong, I realized that I was never taught to explain why certain models should be used with caution. In today's society we are taght to emphasize our strengths and ignore our weaknesses.
All models are wrong, but some are useful. If anyone ever wants you or the government to do something based on a model be exteremly skeptical. Science has always been political, and models are always wrong.