Shocking revelations while predicting COVID-19 cases using ML
Hey everyone!
I hope you all are doing safe at home.
As a followup of my previous blog for the “MSP Developer Stories” contest organized by Microsoft Student Partner community, exclusively for the region India, this blog is dedicated to the topic Model Training using Machine Learning on Azure Notebooks.
So let’s begin with looking at how we generally train a model using python:
1. Firstly load a dataset. I’m using the most popular dataset that beginners use: Iris Dataset
from sklearn.datasets import load_iris
iris = load_iris()
2. Then we load the data and the target results we would like to see. Eg, the flower details are the data and the flower name that it will yield is the target.
X = iris.data
y = iris.target
3. Now divide the data into two parts: train and test. Train is used to provide learning to our model and test is used to verify whether the model has been accurately trained.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
4. Now we have to import a model and train it.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
5. Let’s predict some data using our model using test data and compare how accurate the actual results are as compared to predicted results.
y_pred = knn.predict(X_test)
from sklearn import metrics
print(“kNN model accuracy:”, metrics.accuracy_score(y_test, y_pred))
Whoa! the accuracy of 0.98 is pretty cool.
Now, I am going to implement this knowledge of model training on my previous project based on Covid-19 cases in Italy. I would suggest you pre-load the dataset and rename the columns too, using the steps mentioned in my previous blog.
- So, we first import the model. I am using the ARIMA model here, as my data which I have to train varies with time and the ARIMA model fits well for the time series kind of data.
from statsmodels.tsa.arima_model import ARIMA
2. Load the dataframe with the column whose prediction we want to make. I want to predict the total corona positive cases in the coming time. So load the dataframe with this column.
df=df[[‘Total amount of positive cases’]]
3. Now train the model and check the summary.
model=ARIMA(df,order=(1,1,1))
model_fit=model.fit(disp=0)
print(model_fit.summary())
4. What!! don’t you all think the error(std err) is too high? I looked more into this issue and found out that maybe my time series dataset is not stationary. So I performed some basic tests to check whether my data is stationary or not.
from numpy import log
x=df.values
split=round(len(x)/2)
x1,x2=x[0:split],x[split:]
mean1,mean2=x1.mean(),x2.mean()
var1,var2=x1.var(),x2.var()
print(“mean1=%f, mean2=%f”%(mean1,mean2))
print(“variance1=%f,variance2=%f”%(var1,var2))
5. The difference in variances clearly suggests that my data isn’t stationary at all. I also plotted a histogram to verify and the results suggested an obvious non-stationary data.
df.hist()
TIP: You can also perform an ADF test to check whether the time-series is stationary or not.
6. Now after knowing that my data isn’t stationary, I need to convert it into a stationary series. A log transform can be used to flatten out exponential change back to a linear relationship.
dfl=log(df)
dfl.hist()
dfl.plot()
x=dfl.values #log values
split=round(len(x)/2)
x1,x2=x[0:split],x[split:]
mean1,mean2=x1.mean(),x2.mean()
var1,var2=x1.var(),x2.var()
print(“mean1=%f, mean2=%f”%(mean1,mean2))
print(“variance1=%f,variance2=%f”%(var1,var2))
7. Ohh!! even the log couldn’t convert it into stationary series. Let’s implement the most tried method to convert non-stationary series to stationary, i.e. Differencing. You can implement explicitly by defining a function or you may use inbuilt methods too.
(I) Explicitly:
from pandas import Seriesdef difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] — dataset[i — interval]
diff.append(value)
return Series(diff)diff = difference(df.values)from matplotlib import pyplot
pyplot.plot(diff)
pyplot.show()
(ii) In-built method:
diff = df.diff()
pyplot.plot(diff)
pyplot.show()
8. Ohh! I am still failing to convert the non-stationary data to stationary data by the use of differencing too. Okay, I am going to do differencing a couple of times more by the use of the inbuilt method. I have performed differencing 3 more times.
9. Slice the diff var to 4 times because, in each difference, the first value gets converted to NaN.
diff=diff[4:]
10. Let’s train our model again and see what happens this time…
model=ARIMA(diff,order=(1,1,1))
model_fit=model.fit(disp=0)
print(model_fit.summary())
Still, there is a high error in the results!!
What do these errors mean?
That COVID can’t be predicted?
That all the predictions you are listening around are false?
That there is no certainty of this virus ending?
No certainty of its peak point?
No certainty of the lockdown period?
I am going to leave you with these questions in mind until my next blog. We both will try to find an answer to these questions meanwhile.
Until then, stay safe and stay home.