Regression Modeling in Practice – Test a Multiple Regression Model

SAS Program

Samples

143 countries

Variables

Quantitative Response Variable: lifeexpectancy (representing life expectancy)

Primary Quantitative Explanatory Variable: incomeperperson (representing income per person in a year in US Dollars)

The Second Quantitative Explanatory Variable: breastcancerper100th (representing number of breast cancer new case per 100,000 females)

The Third Quantitative Explanatory Variable: HIVrate (representing HIV rate)

SAS Program

/* Start the data step */

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new2; set mydata.gapminder;

 

/* Center explanatory variables */

incomeperperson_c=incomeperperson-8740.97;

breastcancerper100th_c=breastcancerper100th-37.4028902;

HIVrate_c=HIVrate-1.9354422;

 

/* Create secondary variables for quaratic regression */

incomeperperson2=incomeperperson_c*incomeperperson_c;

breastcancerper100th2=breastcancerper100th_c*breastcancerper100th_c;

HIVrate2=HIVrate_c*HIVrate_c;

run;

 

/* Check means for centered explanatory variables */

PROC means; var incomeperperson_c breastcancerper100TH_c HIVrate_c;

run;

 

/* Test confounders for income per person*/

PROC glm; model lifeexpectancy=incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=incomeperperson_c breastcancerper100th_c/solution;

run;

PROC glm; model lifeexpectancy=incomeperperson_c breastcancerper100th_c HIVrate_c/solution;

run;

 

/* Test confounders for breast cancer number*/

PROC glm; model lifeexpectancy=breastcancerper100th_c/solution;

run;

PROC glm; model lifeexpectancy=breastcancerper100th_c incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=breastcancerper100th_c incomeperperson_c HIVrate_c/solution;

run;

 

/* Test confounders for HIV rate*/

PROC glm; model lifeexpectancy=HIVrate_c/solution;

run;

PROC glm; model lifeexpectancy=HIVrate_c incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=HIVrate_c incomeperperson_c breastcancerper100th_c/solution;

run;

 

/* Create residual plot */

PROC glm PLOTS (unpack)=all;

model lifeexpectancy=incomeperperson_c incomeperperson_c*incomeperperson_c breastcancerper100th_c breastcancerper100th_c*breastcancerper100th_c HIVrate_c HIVrate_c*HIVrate_c/solution clparm;

output residual=res student=stdres out=results;

 

/* Generate standardized residuals for all observations */

PROC gplot;

label stdres=”Standardized Residual” country=”Country”;

plot stdres*country/vref=0;

run;

 

/* Test a multiple regression model and generate regression diagnostic plots */

PROC reg plots=partial;

model lifeexpectancy=incomeperperson_c incomeperperson2 breastcancerper100th_c breastcancerper100th2 HIVrate_c HIVrate2/partial;

run;

 

Output

Multiple Regression Model for All Explanatory Variables, Adding Quadratic Term

20160207 summary 1 20160207 summary 2

Regression Model for Income and Life Expectancy 

20160207 confound part 1 20160207 confound part 2

Multiple Regression Model for Income, Breast Cancer, and Life Expectancy 

20160207 confound part 3 20160207 confound part 4

Multiple Regression Model for Income, Breast Cancer, HIV Rate, and Life Expectancy 

20160207 confound part 5 20160207 confound part 6

Q-Q Plot

20160207 q-q plot 1

Standardized Residuals for All Observations

20160207 Standardized Residual Plot 1

Leverage Plot

20160207 leverage plot 1

Partial Regression Residual Plots

20160207 partial plots part 1 20160207 partial plots part 2

Summarize what you found. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary.

According to the multiple regression model with quadratic term added (the first model in the Output session), after controlling for potential confounding factors including breast cancer per 100,000 females and HIV rate, income per person (regression coefficient=0.00077510, p<.0001) is significantly and positively associated with life expectancy.  HIV rate (regression coefficient=-2.31347872, p<.0001) is significantly and negatively associated with life expectancy after controlling for income per person and breast cancer per 100,000 females.  However, after controlling for income and HIV rate, breast cancer (regression coefficient=0.06891852, p=0.0517) is not statistically significant to life expectancy, and it does NOT have enough evidence to reject the null hypothesis even though the p-value was so close to 0.05.

Income per person, breast cancer per 100,000 females, and HIV rate explain about 76.55% (r² = 0.7655 ) of the variability in life expectancy.

 

Report whether or not your results supported your hypothesis for the association between your primary explanatory and response variable.

The primary explanatory value for this case is income per person, while the response variable is life expectancy. The above result supported the hypothesis for the association between income per person and life expectancy.  It is because income per person (regression coefficient=0.00077510, p<.0001) was significantly and positively associated with life expectancy before and after controlling for potential confounding factors including breast cancer and HIV rate.

 

Discuss whether or not there was evidence of confounding for the association between your primary explanatory and response variable.

There is NO evidence of confounding for the association between income per person and life expectancy. I added additional explanatory variables, breast cancer and HIV rate, to the model one at a time to identify whether they are confounding variables. NO variables in the test confound the relationship between income and life expectancy. The p-value before controlling for the above mentioned potential confounding factors was less than 0.0001, while the p-value after controlling the potential confounding factors was 0.0003. Income and life expectancy are significantly associated before and after controlling for other factors.

 

Generate regression diagnostic plots and write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.

Q-Q Plot

The Q-Q Plot (located in the above Output session) does not have a perfect straight line. It means that the residuals do not perfectly follow the normal distribution, particularly the lower and upper parts. It could mean that the curvilinear association may not be fully estimated by the quadratic income term.

Standardized Residuals for All Observations

In the Standardized Residuals Plot for 143 observations/countries (plot located in the above Output session), there are 4 outliers in 2.5 standard deviation from the mean. It means that only 97.2% (less than 99%) of countries are in 2.5 standard deviation. In addition, there are 9 outliers in 2 standard deviation from the mean. It means that only 93.7% (less than 95%) of countries are in 2 standard deviation. In other words, residual values may seem close to but not perfectly fit a standard, normal distribution.

Leverage Plot

The leverage plot (located in the above Output session) shows that there are 9 outliers that are greater than 2 or less than -2.  7 of the outliers are close to the leverage value of 0 (between 0 and 0.1).  Only 2 outliers with higher leverage value have stronger influence on the estimation of the regression parameters.

Partial Regression Residual Plots

The partial plot of income per person (in the Output session) shows relationship between life expectancy and income after controlling for other explanatory variables, the breast cancer number and HIV rate.  Partial plots of breast cancer number and HIV rate are also presented in the Output session. In all three plots, many residuals are far from the regression lines. Although the positive relationship between income and life expectancy and the negative relationship between HIV rate and life expectancy are statistically significant, the associations are weak after controlling for other explanatory variables.