[SOLVED] Repeated columns of a single variable when using statsmodels.formula.api package ols function in python

Issue

This Content is from Stack Overflow. Question asked by pythonqueries

I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.

auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())

The data consists the following variables – mpg, cylinders, displacement, horsepower, weight , acceleration, year, origin and name. When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct. Im not sure why?

screenshot of repeated rows



Solution

It’s likely because of the data type of the horsepower column. If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing. Check the data type (run auto_1.dtypes) and cast the column to a numeric type (it’s best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.

Here is an example where a column with numeric values is cast (i.e. converted) to strings (or categories):

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.DataFrame(
    {
        'mpg': np.random.randint(20, 40, 50),
        'horsepower': np.random.randint(100, 200, 50)
    }
)
# convert integers to strings (or categories)
df['horsepower'] = (
    df['horsepower'].astype('str')  # same result with .astype('category')
)

formula = 'mpg ~ horsepower'

results = smf.ols(formula, df).fit()
print(results.summary())

Output (dummy coding):

OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                 -0.207
Method:                 Least Squares   F-statistic:                    0.7901
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.715
Time:                        20:17:51   Log-Likelihood:                -110.27
No. Observations:                  50   AIC:                             302.5
Df Residuals:                       9   BIC:                             380.9
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            32.0000      5.175      6.184      0.000      20.294      43.706
horsepower[T.103]    -4.0000      7.318     -0.547      0.598     -20.555      12.555
horsepower[T.112]    -1.0000      7.318     -0.137      0.894     -17.555      15.555
horsepower[T.116]    -9.0000      7.318     -1.230      0.250     -25.555       7.555
horsepower[T.117]     6.0000      7.318      0.820      0.433     -10.555      22.555
horsepower[T.118]     2.0000      7.318      0.273      0.791     -14.555      18.555
horsepower[T.120]    -4.0000      6.338     -0.631      0.544     -18.337      10.337

etc.

Now, converting the strings back to integers:

df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')

results = smf.ols(formula, df).fit()
print(results.summary())

Output (as expected):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.5388
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.466
Time:                        20:24:54   Log-Likelihood:                -147.65
No. Observations:                  50   AIC:                             299.3
Df Residuals:                      48   BIC:                             303.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     31.7638      3.663      8.671      0.000      24.398      39.129
horsepower    -0.0176      0.024     -0.734      0.466      -0.066       0.031
==============================================================================
Omnibus:                        3.529   Durbin-Watson:                   1.859
Prob(Omnibus):                  0.171   Jarque-Bera (JB):                1.725
Skew:                           0.068   Prob(JB):                        0.422
Kurtosis:                       2.100   Cond. No.                         834.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


This Question was asked in StackOverflow by pythonqueries and Answered by AlexK It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?