Regression Modeling in Practice – Test a Logistic Regression Model

Research Question

Is country’s income per person related to its citizens’ average life expectancy at birth?

Data Set

GapMinder

Sample

143 countries

Variables

Categorical Response Variable

lifeexpectancygroup (representing life expectancy)

0 = 0 to 70 years of age

1 = 70.01 to 90 years of age

Primary Explanatory Variable

incomeperperson (representing income per person in a year in US Dollars)

incomeperperson_c (centered incomeper person)

Other Explanatory Variables

breastcancerper100th (representing number of breast cancer new cases per 100,000 females)

breastcancerper100th_c (centered breastcancerper100th)

HIVrate (representing HIV rate)

HIVrate_c (centered HIVrate)

 

Summary

Summarize in a few sentences what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary.

Countries with higher income per person were 1.002 times more likely to have longer life expectancy (70.01-90 years of age) in average than countries with less income per person, after controlling for number of breast cancer new cases and HIV rate.  Based on this model, we can be 95% confident that if another sample is selected from the population, countries with higher income per person are between 1.001 and 1.003 times more likely to have longer life expectancy than countries with lower income per person (Odds Ratio=1.002, 95% Confidence Intervals=1.001 to 1.003, P-value<.0001).

Countries with higher HIV rate were 0.092 times less likely to have longer life expectancy (70.01-90 years of age) in average than countries with lower HIV rate, after controlling for income per person and number of breast cancer new cases. Based on this model, we can be 95% confident that if another sample is selected from the population, countries with higher HIV rate are between 0.023 and 0.362 less likely to have longer life expectancy than countries with lower HIV rate (Odds Ratio=0.092, 95% Confidence Interval=0.023 to 0.362, P-value=0.0006).

After controlling for income per person and HIV rate, number of breast cancer new cases was NOT statistically significant to life expectancy, and it does NOT have enough evidence to reject the null hypothesis (P-value=0.9040). In this case, odds ratio of 1.004 and confidence interval between 0.947 and 1.064 are no longer meaningful.

Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.

The primary explanatory variable in this model is income per person, while the categorical response variable is life expectancy. The above result supported the hypothesis for the association between income per person and life expectancy.  It was because income per person was significantly and positively associated with life expectancy before (Odds Ratio=1.001, P-value<.0001) and after (Odds Ratio=1.002, P-value<.0001) controlling for potential confounding factors including breast cancer and HIV rate.

Discuss whether or not there was evidence of confounding for the association between your primary explanatory variable and the response variable.

There is NO evidence of confounding for the association between income per person and life expectancy. I added additional explanatory variables, breast cancer and HIV rate, to the model one at a time to identify whether they were confounding variables. NO variables in the test confound the relationship between income and life expectancy. The p-value before controlling for the above mentioned potential confounding factors was less than 0.0001, while the p-value after controlling the potential confounding factors was still less than 0.0001. Income and life expectancy were significantly associated before and after controlling for other factors in this model.

 

Logistic Regression Output

Check the Means

20160213 check mean

Logistic Regression Model for All Explanatory Variables and Life Expectancy

20160213 log regression 5 20160213 log regression 6

Logistic Regression Model for Income (Primary Explanatory Variable) and Life Expectancy

20160213 log regression 1 20160213 log regression 2

Logistic Regression Model for Income, Breast Cancer, and Life Expectancy

20160213 log regression 3 20160213 log regression 4

 

SAS Program

/* Start the data step */

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new2; set mydata.gapminder;

 

/* Bin response variable lifeexpectancy into 2 categories */

IF lifeexpectancy LE 70 AND lifeexpectancy GE 0 THEN lifeexpectancygroup=0;

ELSE IF lifeexpectancy LE 90 AND lifeexpectancy GT 70 THEN lifeexpectancygroup=1;

 

/* Center explanatory variables */

incomeperperson_c=incomeperperson-8740.97;

breastcancerper100th_c=breastcancerper100th-37.4028902;

HIVrate_c=HIVrate-1.9354422;

run;

 

/* Check means for centered explanatory variables */

PROC means; var incomeperperson_c breastcancerper100TH_c HIVrate_c;

run;

 

/* Run logistic regression and test confounder*/

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c;

run;

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c breastcancerper100th_c;

run;

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c breastcancerper100th_c HIVrate_c;

run;

Regression Modeling in Practice – Test a Multiple Regression Model

SAS Program

Samples

143 countries

Variables

Quantitative Response Variable: lifeexpectancy (representing life expectancy)

Primary Quantitative Explanatory Variable: incomeperperson (representing income per person in a year in US Dollars)

The Second Quantitative Explanatory Variable: breastcancerper100th (representing number of breast cancer new case per 100,000 females)

The Third Quantitative Explanatory Variable: HIVrate (representing HIV rate)

SAS Program

/* Start the data step */

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new2; set mydata.gapminder;

 

/* Center explanatory variables */

incomeperperson_c=incomeperperson-8740.97;

breastcancerper100th_c=breastcancerper100th-37.4028902;

HIVrate_c=HIVrate-1.9354422;

 

/* Create secondary variables for quaratic regression */

incomeperperson2=incomeperperson_c*incomeperperson_c;

breastcancerper100th2=breastcancerper100th_c*breastcancerper100th_c;

HIVrate2=HIVrate_c*HIVrate_c;

run;

 

/* Check means for centered explanatory variables */

PROC means; var incomeperperson_c breastcancerper100TH_c HIVrate_c;

run;

 

/* Test confounders for income per person*/

PROC glm; model lifeexpectancy=incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=incomeperperson_c breastcancerper100th_c/solution;

run;

PROC glm; model lifeexpectancy=incomeperperson_c breastcancerper100th_c HIVrate_c/solution;

run;

 

/* Test confounders for breast cancer number*/

PROC glm; model lifeexpectancy=breastcancerper100th_c/solution;

run;

PROC glm; model lifeexpectancy=breastcancerper100th_c incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=breastcancerper100th_c incomeperperson_c HIVrate_c/solution;

run;

 

/* Test confounders for HIV rate*/

PROC glm; model lifeexpectancy=HIVrate_c/solution;

run;

PROC glm; model lifeexpectancy=HIVrate_c incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=HIVrate_c incomeperperson_c breastcancerper100th_c/solution;

run;

 

/* Create residual plot */

PROC glm PLOTS (unpack)=all;

model lifeexpectancy=incomeperperson_c incomeperperson_c*incomeperperson_c breastcancerper100th_c breastcancerper100th_c*breastcancerper100th_c HIVrate_c HIVrate_c*HIVrate_c/solution clparm;

output residual=res student=stdres out=results;

 

/* Generate standardized residuals for all observations */

PROC gplot;

label stdres=”Standardized Residual” country=”Country”;

plot stdres*country/vref=0;

run;

 

/* Test a multiple regression model and generate regression diagnostic plots */

PROC reg plots=partial;

model lifeexpectancy=incomeperperson_c incomeperperson2 breastcancerper100th_c breastcancerper100th2 HIVrate_c HIVrate2/partial;

run;

 

Output

Multiple Regression Model for All Explanatory Variables, Adding Quadratic Term

20160207 summary 1 20160207 summary 2

Regression Model for Income and Life Expectancy 

20160207 confound part 1 20160207 confound part 2

Multiple Regression Model for Income, Breast Cancer, and Life Expectancy 

20160207 confound part 3 20160207 confound part 4

Multiple Regression Model for Income, Breast Cancer, HIV Rate, and Life Expectancy 

20160207 confound part 5 20160207 confound part 6

Q-Q Plot

20160207 q-q plot 1

Standardized Residuals for All Observations

20160207 Standardized Residual Plot 1

Leverage Plot

20160207 leverage plot 1

Partial Regression Residual Plots

20160207 partial plots part 1 20160207 partial plots part 2

Summarize what you found. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary.

According to the multiple regression model with quadratic term added (the first model in the Output session), after controlling for potential confounding factors including breast cancer per 100,000 females and HIV rate, income per person (regression coefficient=0.00077510, p<.0001) is significantly and positively associated with life expectancy.  HIV rate (regression coefficient=-2.31347872, p<.0001) is significantly and negatively associated with life expectancy after controlling for income per person and breast cancer per 100,000 females.  However, after controlling for income and HIV rate, breast cancer (regression coefficient=0.06891852, p=0.0517) is not statistically significant to life expectancy, and it does NOT have enough evidence to reject the null hypothesis even though the p-value was so close to 0.05.

Income per person, breast cancer per 100,000 females, and HIV rate explain about 76.55% (r² = 0.7655 ) of the variability in life expectancy.

 

Report whether or not your results supported your hypothesis for the association between your primary explanatory and response variable.

The primary explanatory value for this case is income per person, while the response variable is life expectancy. The above result supported the hypothesis for the association between income per person and life expectancy.  It is because income per person (regression coefficient=0.00077510, p<.0001) was significantly and positively associated with life expectancy before and after controlling for potential confounding factors including breast cancer and HIV rate.

 

Discuss whether or not there was evidence of confounding for the association between your primary explanatory and response variable.

There is NO evidence of confounding for the association between income per person and life expectancy. I added additional explanatory variables, breast cancer and HIV rate, to the model one at a time to identify whether they are confounding variables. NO variables in the test confound the relationship between income and life expectancy. The p-value before controlling for the above mentioned potential confounding factors was less than 0.0001, while the p-value after controlling the potential confounding factors was 0.0003. Income and life expectancy are significantly associated before and after controlling for other factors.

 

Generate regression diagnostic plots and write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.

Q-Q Plot

The Q-Q Plot (located in the above Output session) does not have a perfect straight line. It means that the residuals do not perfectly follow the normal distribution, particularly the lower and upper parts. It could mean that the curvilinear association may not be fully estimated by the quadratic income term.

Standardized Residuals for All Observations

In the Standardized Residuals Plot for 143 observations/countries (plot located in the above Output session), there are 4 outliers in 2.5 standard deviation from the mean. It means that only 97.2% (less than 99%) of countries are in 2.5 standard deviation. In addition, there are 9 outliers in 2 standard deviation from the mean. It means that only 93.7% (less than 95%) of countries are in 2 standard deviation. In other words, residual values may seem close to but not perfectly fit a standard, normal distribution.

Leverage Plot

The leverage plot (located in the above Output session) shows that there are 9 outliers that are greater than 2 or less than -2.  7 of the outliers are close to the leverage value of 0 (between 0 and 0.1).  Only 2 outliers with higher leverage value have stronger influence on the estimation of the regression parameters.

Partial Regression Residual Plots

The partial plot of income per person (in the Output session) shows relationship between life expectancy and income after controlling for other explanatory variables, the breast cancer number and HIV rate.  Partial plots of breast cancer number and HIV rate are also presented in the Output session. In all three plots, many residuals are far from the regression lines. Although the positive relationship between income and life expectancy and the negative relationship between HIV rate and life expectancy are statistically significant, the associations are weak after controlling for other explanatory variables.