Regression Modeling in Practice – Test a Logistic Regression Model

Research Question

Is country’s income per person related to its citizens’ average life expectancy at birth?

Data Set

GapMinder

Sample

143 countries

Variables

Categorical Response Variable

lifeexpectancygroup (representing life expectancy)

0 = 0 to 70 years of age

1 = 70.01 to 90 years of age

Primary Explanatory Variable

incomeperperson (representing income per person in a year in US Dollars)

incomeperperson_c (centered incomeper person)

Other Explanatory Variables

breastcancerper100th (representing number of breast cancer new cases per 100,000 females)

breastcancerper100th_c (centered breastcancerper100th)

HIVrate (representing HIV rate)

HIVrate_c (centered HIVrate)

 

Summary

Summarize in a few sentences what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary.

Countries with higher income per person were 1.002 times more likely to have longer life expectancy (70.01-90 years of age) in average than countries with less income per person, after controlling for number of breast cancer new cases and HIV rate.  Based on this model, we can be 95% confident that if another sample is selected from the population, countries with higher income per person are between 1.001 and 1.003 times more likely to have longer life expectancy than countries with lower income per person (Odds Ratio=1.002, 95% Confidence Intervals=1.001 to 1.003, P-value<.0001).

Countries with higher HIV rate were 0.092 times less likely to have longer life expectancy (70.01-90 years of age) in average than countries with lower HIV rate, after controlling for income per person and number of breast cancer new cases. Based on this model, we can be 95% confident that if another sample is selected from the population, countries with higher HIV rate are between 0.023 and 0.362 less likely to have longer life expectancy than countries with lower HIV rate (Odds Ratio=0.092, 95% Confidence Interval=0.023 to 0.362, P-value=0.0006).

After controlling for income per person and HIV rate, number of breast cancer new cases was NOT statistically significant to life expectancy, and it does NOT have enough evidence to reject the null hypothesis (P-value=0.9040). In this case, odds ratio of 1.004 and confidence interval between 0.947 and 1.064 are no longer meaningful.

Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.

The primary explanatory variable in this model is income per person, while the categorical response variable is life expectancy. The above result supported the hypothesis for the association between income per person and life expectancy.  It was because income per person was significantly and positively associated with life expectancy before (Odds Ratio=1.001, P-value<.0001) and after (Odds Ratio=1.002, P-value<.0001) controlling for potential confounding factors including breast cancer and HIV rate.

Discuss whether or not there was evidence of confounding for the association between your primary explanatory variable and the response variable.

There is NO evidence of confounding for the association between income per person and life expectancy. I added additional explanatory variables, breast cancer and HIV rate, to the model one at a time to identify whether they were confounding variables. NO variables in the test confound the relationship between income and life expectancy. The p-value before controlling for the above mentioned potential confounding factors was less than 0.0001, while the p-value after controlling the potential confounding factors was still less than 0.0001. Income and life expectancy were significantly associated before and after controlling for other factors in this model.

 

Logistic Regression Output

Check the Means

20160213 check mean

Logistic Regression Model for All Explanatory Variables and Life Expectancy

20160213 log regression 5 20160213 log regression 6

Logistic Regression Model for Income (Primary Explanatory Variable) and Life Expectancy

20160213 log regression 1 20160213 log regression 2

Logistic Regression Model for Income, Breast Cancer, and Life Expectancy

20160213 log regression 3 20160213 log regression 4

 

SAS Program

/* Start the data step */

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new2; set mydata.gapminder;

 

/* Bin response variable lifeexpectancy into 2 categories */

IF lifeexpectancy LE 70 AND lifeexpectancy GE 0 THEN lifeexpectancygroup=0;

ELSE IF lifeexpectancy LE 90 AND lifeexpectancy GT 70 THEN lifeexpectancygroup=1;

 

/* Center explanatory variables */

incomeperperson_c=incomeperperson-8740.97;

breastcancerper100th_c=breastcancerper100th-37.4028902;

HIVrate_c=HIVrate-1.9354422;

run;

 

/* Check means for centered explanatory variables */

PROC means; var incomeperperson_c breastcancerper100TH_c HIVrate_c;

run;

 

/* Run logistic regression and test confounder*/

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c;

run;

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c breastcancerper100th_c;

run;

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c breastcancerper100th_c HIVrate_c;

run;