python - Statsmodels throws "overflow in exp" and "divide by zero in log" warnings and pseudo-R squared is -inf -
i want logistic regression in python using statsmodels.
x , y have 750 rows each, y the binary outcome , in x 10 features (including intecept).
here first 12 rows of x (last column intercept):
lngdp_ lnpop sxp sxp2 gy1 frac etdo4590 geogia \ 0 7.367709 16.293980 0.190 0.036100 -1.682 132.0 1 0.916 1 7.509883 16.436258 0.193 0.037249 2.843 132.0 1 0.916 2 7.759187 16.589224 0.269 0.072361 4.986 132.0 1 0.916 3 7.922261 16.742384 0.368 0.135424 3.261 132.0 1 0.916 4 8.002359 16.901037 0.170 0.028900 1.602 132.0 1 0.916 5 7.929126 17.034786 0.179 0.032041 -1.465 132.0 1 0.916 6 6.594413 15.627563 0.360 0.129600 -9.321 4134.0 0 0.648 7 6.448889 16.037861 0.476 0.226576 -2.356 3822.0 0 0.648 8 8.520786 16.919334 0.048 0.002304 2.349 434.0 1 0.858 9 8.637107 16.991980 0.050 0.002500 2.326 434.0 1 0.858 10 8.708144 17.075489 0.042 0.001764 1.421 465.0 1 0.858 11 8.780480 17.151779 0.080 0.006400 1.447 496.0 1 0.858 peace intercept 0 24.0 1.0 1 84.0 1.0 2 144.0 1.0 3 204.0 1.0 4 264.0 1.0 5 324.0 1.0 6 1.0 1.0 7 16.0 1.0 8 112.0 1.0 9 172.0 1.0 10 232.0 1.0 11 292.0 1.0
this code:
import statsmodels.api sm logit = sm.logit(y, x, missing='drop') result = logit.fit() print(result.summary())
this output:
optimization terminated successfully. current function value: inf iterations 9
/home/ipattern/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:1214: runtimewarning: overflow encountered in exp
return 1/(1+np.exp(-x))/home/ipattern/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:1264: runtimewarning: divide 0 encountered in log
return np.sum(np.log(self.cdf(q*np.dot(x,params))))
logit regression results ============================================================================== dep. variable: warsa no. observations: 750 model: logit df residuals: 740 method: mle df model: 9 date: tue, 12 sep 2017 pseudo r-squ.: -inf time: 11:16:58 log-likelihood: -inf converged: true ll-null: -4.6237e+05 llr p-value: 1.000 ============================================================================== coef std err z p>|z| [0.025 0.975] ------------------------------------------------------------------------------ lngdp_ -0.9504 0.245 -3.872 0.000 -1.431 -0.469 lnpop 0.5105 0.128 3.975 0.000 0.259 0.762 sxp 16.7734 5.206 3.222 0.001 6.569 26.978 sxp2 -23.8004 10.040 -2.371 0.018 -43.478 -4.123 gy1 -0.0980 0.041 -2.362 0.018 -0.179 -0.017 frac -0.0002 9.2e-05 -2.695 0.007 -0.000 -6.76e-05 etdo4590 0.4801 0.328 1.463 0.144 -0.163 1.124 geogia -0.9919 0.909 -1.091 0.275 -2.774 0.790 peace -0.0038 0.001 -3.808 0.000 -0.006 -0.002 intercept -3.4375 2.486 -1.383 0.167 -8.310 1.435 ==============================================================================
the coefficients, std err, p value etc. @ bottom correct (i know because have "solution").
but can see current function value inf
wrong think.
and 2 warnings. apparently statsmodels np.exp(bignumber), e.g. np.exp(999) , np.log(0) somewhere.
also pseudo r-squ. -inf
, log-likelihood -inf
, shouldn't -inf
think.
so doing wrong?
edit:
x.describe():
lngdp_ lnpop sxp sxp2 gy1 \ count 750.000000 750.000000 750.000000 750.000000 750.000000 mean 7.766948 15.702191 0.155329 0.043837 1.529772 std 1.045121 1.645154 0.140486 0.082838 3.546621 min 5.402678 11.900227 0.002000 0.000004 -13.088000 25% 6.882694 14.723123 0.056000 0.003136 -0.411250 50% 7.696212 15.680984 0.111000 0.012321 1.801000 75% 8.669355 16.652981 0.203000 0.041209 3.625750 max 9.851826 20.908354 0.935000 0.874225 14.409000 frac etdo4590 geogia peace intercept count 750.000000 750.000000 750.000000 750.000000 750.0 mean 1812.777333 0.437333 0.600263 348.209333 1.0 std 1982.106029 0.496388 0.209362 160.941996 0.0 min 12.000000 0.000000 0.000000 1.000000 1.0 25% 176.000000 0.000000 0.489250 232.000000 1.0 50% 864.000000 0.000000 0.608000 352.000000 1.0 75% 3375.000000 1.000000 0.763000 472.000000 1.0 max 6975.000000 1.000000 0.971000 592.000000 1.0
logit.loglikeobs(result.params):
array([ -4.61803704e+01, -2.26983454e+02, -2.66741244e+02, -2.60206733e+02, -4.75585266e+02, -1.76454554e+00, -4.86048292e-01, -8.02300533e-01, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -6.02780923e+02, -4.12209348e+02, -6.42901288e+02, -6.94331125e+02, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, ...
(logit.exog * np.array(result.params)).min(0):
array([ -9.36347474, 6.07506083, 0.03354677, -20.80694575, -1.41162588, -1.72895247, 0. , -0.9631801 , -2.23188846, -3.4374963 ])
datasets:
i surprised still converges in case.
there can convergence problems overflow exp functions used in logit or poisson when x values large. can avoided rescaling regressors.
however, in case guess outliers in x. 6th column has values 4134.0 while others smaller.
you check loglikelihood each observation logit.loglikeobs(result.params)
see observations might cause problems, logit
name references model
also contribution of each predictor might help, example
np.argmax(np.abs(logit.exog * result.params), 0)
or
(logit.exog * result.params).min(0)
if it's 1 or few observations, dropping them might help. rescaling exog not this, because upon convergence compensated rescaling of estimated coefficient.
also check whether there not encoding error or large value place holder missing values.
edit
given number of -inf
in loglikeobs seems large, think there might more fundamental problem outliers, in sense logit model not correctly specified maximum likelihood model dataset.
two possibilites in general (because haven't seen dataset):
perfect separation: logit assumes predicted probabilities stay away 0 , one. in cases explanatory variable or combination of them allows perfect prediction of dependent variable. in case parameters either not identified or go plus or minus infinity. actual parameter estimates depend on convergence criteria optimization. statsmodels logit detects cases , raises , perfectseparation exception, doesn't detect cases partial separation.
logit or glm-binomial in 1 parameter linear exponential family. parameter estimates in case depend on specified mean function , implied variance. not require likelihood function correctly specified. possible (consistent) estimates if likelihood function not correct given dataset. in case solution quasi-maximum likelihood estimator, loglikelihood value invalid.
this can have effect results in terms of convergence , numerical stability depend on computational details how edge or extreme cases handled. statsmodels clipping values keep them away bounds in cases not yet everywhere.
the difficulty in figuring out numerical problems , avoid returning "some" numbers without warning user when underlying model inappropriate or incompatible data.
maybe llf = -inf
"correct" answer in case, , finite numbers approximation -inf. maybe it's numerical problem because of way functions implemented in double precision.
Comments
Post a Comment