I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt import pandas as pd from scipy import stats import numpy as np from sklearn.metrics import r2_score #Define the path for the file path=r"C:UsersHDesktopFilesData.xlsx" #Read the file into a dataframe ensuring to group by weeks df=pd.read_excel(path, sheet_name = 0) df=df.groupby(['Week']).sum() df = df.reset_index() #Define x and y x=df['Week'] y=df['Payment Amount Total'] #Draw the scatter plot plt.scatter(x, y) plt.show() #Now we draw the line of linear regression #First we want to look for these values slope, intercept, r, p, std_err = stats.linregress(x, y) #We then create a function def myfunc(x): #Below is y = mx + c return slope * x + intercept #Run each value of the x array through the function. This will result in a new array with new values for the y-axis: mymodel = list(map(myfunc, x)) #We plot the scatter plot and line plt.scatter(x, y) plt.plot(x, mymodel) plt.show() #We print the value of r print(r) #We predict what the cost will be in week 23 print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
- a normal linear regression model
- a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I’m not sure if it’s correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:] #I display the training set: plt.scatter(train_x, train_y) plt.show() #I display the testing set: plt.scatter(test_x, test_y) plt.show() mymodel = np.poly1d(np.polyfit(train_x, train_y, 4)) myline = np.linspace(0, 6, 100) plt.scatter(train_x, train_y) plt.plot(myline, mymodel(myline)) plt.show() #Let's look at how well my training data fit in a polynomial regression? mymodel = np.poly1d(np.polyfit(train_x, train_y, 4)) r2 = r2_score(train_y, mymodel(train_x)) print(r2) #Now we want to test the model with the testing data as well mymodel = np.poly1d(np.polyfit(train_x, train_y, 4)) r2 = r2_score(test_y, mymodel(test_x)) print(r2) #Now we can use this model to predict new values: #We predict what the total amount would be on the 23rd week: print(mymodel(23))
You better split to train and test using
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X is your features dataframe and
y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW – the error you are describing could be because you dataframe has only 80 rows, leaving