I’m seeking help with predicting COVID-19 data for the upcoming year ( June 2021 to May 2022 ) based on a sample dataset I have. Could anyone assist with forecasting the number of positive and negative test samples? Attached is the sample data for reference.
Your quick response would be greatly appreciated. Thanks in advance for your help.
I appreciate your faith in me, but I’m hoping there are other people who are better placed to help you with this one. I’m guessing that you want some form of linear, or time series regression but building models and predictions aren’t my area of expertise. To my mind there isn’t a whole lot of data there for making good forecasts with but there might be somebody else who knows better than me.
I’m more engineer than scientist… more data wrangler than data predictor… more yellow nodes than green nodes!
I can help out on the wrangling side though, as you might want to begin by sorting your data in actual month order rather than alphabetical month order. I can assist you with that small thing with the following String Manipulation hack:
Do you predictions by city/state? I agree with @takbb. Its a pretty small dataset by region. There are no predictors so some form of regression is about all that’s possible. Also as @takbb said you’ll need to clean up the dates. @takbb suggested an elegant approach. I’ll show you one of my brute force approaches.
I hate to burst your bubble, but I think your data quality is so poor that you can’t develop a credible model with it - small dataset aside. There are many missing values. In many cases the total number of tests - the negative test does not equal the reported positive tests. The attached workflow shows two ways of assembling dates which may be of some use to you in the future. It flags where the reported and calculated positives don’t match and shows the number of months per state which have a complete dataset. This is a perfect example of why you need to quality check your data before worrying about developing a model.
First of all, sorry for uploading the data without cleaning it. Thank you for replying. @rfeigel, your workflow works very smoothly; thanks for that.
As you mentioned, rather than predicting city-wise, I want to predict the state-wise data for the next year, specifically how much the positive rate will increase.
Yes, we need a large dataset, but unfortunately, we only have this one.
Could you suggest which model to use to achieve this result?
@tqAkshay95 the question is always is what you are trying to find somewhere in the data. With a pandemic there are obviously outside factors that influence the numbers (measures, inoculations). Some illnesses have some seasonality though with covid and depending on country that might not be the case.
Also with pandemics everything might happen in shifted intervals where you see effects some weeks later. From what I hear from friends that worked for large health institutions modeling pandemics is notoriously hard.
As other people have said a time series approach might be best. You could also try a regression approach where you give the values of the previous months as features. You might take that as an exercise though with your very limited data I doubt that the model would really work.
I’ll let @mlauber71 speak for himself, but I think we’re both very skeptical that you can develop a credible model with this dataset. There are two ways to think about this - can I develop a model and should I develop a model. You can always go through the exercise of developing a model, but the result can take on a life of its own and engender unwarranted confidence. Here’s a partial list of states with complete (both positive and negative) data from my workflow. Also as @mlauber71 previously mentioned there are many extraneous factors which may affect the future.