python - Statsmodels throws "overflow in exp" and "divide by zero in log" warnings and pseudo-R squared is -inf -


i want logistic regression in python using statsmodels.

x , y have 750 rows each, y the binary outcome , in x 10 features (including intecept).

here first 12 rows of x (last column intercept):

      lngdp_      lnpop    sxp      sxp2    gy1    frac  etdo4590  geogia  \ 0   7.367709  16.293980  0.190  0.036100 -1.682   132.0         1   0.916    1   7.509883  16.436258  0.193  0.037249  2.843   132.0         1   0.916    2   7.759187  16.589224  0.269  0.072361  4.986   132.0         1   0.916    3   7.922261  16.742384  0.368  0.135424  3.261   132.0         1   0.916    4   8.002359  16.901037  0.170  0.028900  1.602   132.0         1   0.916    5   7.929126  17.034786  0.179  0.032041 -1.465   132.0         1   0.916    6   6.594413  15.627563  0.360  0.129600 -9.321  4134.0         0   0.648    7   6.448889  16.037861  0.476  0.226576 -2.356  3822.0         0   0.648    8   8.520786  16.919334  0.048  0.002304  2.349   434.0         1   0.858    9   8.637107  16.991980  0.050  0.002500  2.326   434.0         1   0.858    10  8.708144  17.075489  0.042  0.001764  1.421   465.0         1   0.858    11  8.780480  17.151779  0.080  0.006400  1.447   496.0         1   0.858         peace  intercept   0    24.0        1.0   1    84.0        1.0   2   144.0        1.0   3   204.0        1.0   4   264.0        1.0   5   324.0        1.0   6     1.0        1.0   7    16.0        1.0   8   112.0        1.0   9   172.0        1.0   10  232.0        1.0   11  292.0        1.0   

this code:

import statsmodels.api sm  logit = sm.logit(y, x, missing='drop') result = logit.fit() print(result.summary()) 

this output:

     optimization terminated successfully.       current function value: inf       iterations 9 

/home/ipattern/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:1214: runtimewarning: overflow encountered in exp
return 1/(1+np.exp(-x))

/home/ipattern/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:1264: runtimewarning: divide 0 encountered in log
return np.sum(np.log(self.cdf(q*np.dot(x,params))))

                           logit regression results                            ============================================================================== dep. variable:                  warsa   no. observations:                  750 model:                          logit   df residuals:                      740 method:                           mle   df model:                            9 date:                tue, 12 sep 2017   pseudo r-squ.:                    -inf time:                        11:16:58   log-likelihood:                   -inf converged:                       true   ll-null:                   -4.6237e+05                                         llr p-value:                     1.000 ==============================================================================                  coef    std err          z      p>|z|      [0.025      0.975] ------------------------------------------------------------------------------ lngdp_        -0.9504      0.245     -3.872      0.000      -1.431      -0.469 lnpop          0.5105      0.128      3.975      0.000       0.259       0.762 sxp           16.7734      5.206      3.222      0.001       6.569      26.978 sxp2         -23.8004     10.040     -2.371      0.018     -43.478      -4.123 gy1           -0.0980      0.041     -2.362      0.018      -0.179      -0.017 frac          -0.0002    9.2e-05     -2.695      0.007      -0.000   -6.76e-05 etdo4590       0.4801      0.328      1.463      0.144      -0.163       1.124 geogia        -0.9919      0.909     -1.091      0.275      -2.774       0.790 peace         -0.0038      0.001     -3.808      0.000      -0.006      -0.002 intercept     -3.4375      2.486     -1.383      0.167      -8.310       1.435 ============================================================================== 

the coefficients, std err, p value etc. @ bottom correct (i know because have "solution").

but can see current function value inf wrong think.

and 2 warnings. apparently statsmodels np.exp(bignumber), e.g. np.exp(999) , np.log(0) somewhere.

also pseudo r-squ. -inf , log-likelihood -inf, shouldn't -inf think.

so doing wrong?

edit:

x.describe():

           lngdp_       lnpop         sxp        sxp2         gy1  \ count  750.000000  750.000000  750.000000  750.000000  750.000000    mean     7.766948   15.702191    0.155329    0.043837    1.529772    std      1.045121    1.645154    0.140486    0.082838    3.546621    min      5.402678   11.900227    0.002000    0.000004  -13.088000    25%      6.882694   14.723123    0.056000    0.003136   -0.411250    50%      7.696212   15.680984    0.111000    0.012321    1.801000    75%      8.669355   16.652981    0.203000    0.041209    3.625750    max      9.851826   20.908354    0.935000    0.874225   14.409000                   frac    etdo4590      geogia       peace  intercept   count   750.000000  750.000000  750.000000  750.000000      750.0   mean   1812.777333    0.437333    0.600263  348.209333        1.0   std    1982.106029    0.496388    0.209362  160.941996        0.0   min      12.000000    0.000000    0.000000    1.000000        1.0   25%     176.000000    0.000000    0.489250  232.000000        1.0   50%     864.000000    0.000000    0.608000  352.000000        1.0   75%    3375.000000    1.000000    0.763000  472.000000        1.0   max    6975.000000    1.000000    0.971000  592.000000        1.0  

logit.loglikeobs(result.params):

array([ -4.61803704e+01,  -2.26983454e+02,  -2.66741244e+02,         -2.60206733e+02,  -4.75585266e+02,  -1.76454554e+00,         -4.86048292e-01,  -8.02300533e-01,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,  -6.02780923e+02,         -4.12209348e+02,  -6.42901288e+02,  -6.94331125e+02,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf,                    -inf,             -inf,             -inf, ... 

(logit.exog * np.array(result.params)).min(0):

array([ -9.36347474,   6.07506083,   0.03354677, -20.80694575,         -1.41162588,  -1.72895247,   0.        ,  -0.9631801 ,         -2.23188846,  -3.4374963 ]) 

datasets:

x: https://pastebin.com/vrnsepbg

y: https://pastebin.com/j2udyc7m

i surprised still converges in case.

there can convergence problems overflow exp functions used in logit or poisson when x values large. can avoided rescaling regressors.

however, in case guess outliers in x. 6th column has values 4134.0 while others smaller.

you check loglikelihood each observation logit.loglikeobs(result.params) see observations might cause problems, logit name references model

also contribution of each predictor might help, example

np.argmax(np.abs(logit.exog * result.params), 0)

or

(logit.exog * result.params).min(0)

if it's 1 or few observations, dropping them might help. rescaling exog not this, because upon convergence compensated rescaling of estimated coefficient.

also check whether there not encoding error or large value place holder missing values.

edit

given number of -inf in loglikeobs seems large, think there might more fundamental problem outliers, in sense logit model not correctly specified maximum likelihood model dataset.

two possibilites in general (because haven't seen dataset):

perfect separation: logit assumes predicted probabilities stay away 0 , one. in cases explanatory variable or combination of them allows perfect prediction of dependent variable. in case parameters either not identified or go plus or minus infinity. actual parameter estimates depend on convergence criteria optimization. statsmodels logit detects cases , raises , perfectseparation exception, doesn't detect cases partial separation.

logit or glm-binomial in 1 parameter linear exponential family. parameter estimates in case depend on specified mean function , implied variance. not require likelihood function correctly specified. possible (consistent) estimates if likelihood function not correct given dataset. in case solution quasi-maximum likelihood estimator, loglikelihood value invalid.

this can have effect results in terms of convergence , numerical stability depend on computational details how edge or extreme cases handled. statsmodels clipping values keep them away bounds in cases not yet everywhere.

the difficulty in figuring out numerical problems , avoid returning "some" numbers without warning user when underlying model inappropriate or incompatible data.

maybe llf = -inf "correct" answer in case, , finite numbers approximation -inf. maybe it's numerical problem because of way functions implemented in double precision.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -