Regression Modeling in Practice – Test a Logistic Regression Model

Research Question

Is country’s income per person related to its citizens’ average life expectancy at birth?

Data Set

GapMinder

Sample

143 countries

Variables

Categorical Response Variable

lifeexpectancygroup (representing life expectancy)

0 = 0 to 70 years of age

1 = 70.01 to 90 years of age

Primary Explanatory Variable

incomeperperson (representing income per person in a year in US Dollars)

incomeperperson_c (centered incomeper person)

Other Explanatory Variables

breastcancerper100th (representing number of breast cancer new cases per 100,000 females)

breastcancerper100th_c (centered breastcancerper100th)

HIVrate (representing HIV rate)

HIVrate_c (centered HIVrate)

 

Summary

Summarize in a few sentences what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary.

Countries with higher income per person were 1.002 times more likely to have longer life expectancy (70.01-90 years of age) in average than countries with less income per person, after controlling for number of breast cancer new cases and HIV rate.  Based on this model, we can be 95% confident that if another sample is selected from the population, countries with higher income per person are between 1.001 and 1.003 times more likely to have longer life expectancy than countries with lower income per person (Odds Ratio=1.002, 95% Confidence Intervals=1.001 to 1.003, P-value<.0001).

Countries with higher HIV rate were 0.092 times less likely to have longer life expectancy (70.01-90 years of age) in average than countries with lower HIV rate, after controlling for income per person and number of breast cancer new cases. Based on this model, we can be 95% confident that if another sample is selected from the population, countries with higher HIV rate are between 0.023 and 0.362 less likely to have longer life expectancy than countries with lower HIV rate (Odds Ratio=0.092, 95% Confidence Interval=0.023 to 0.362, P-value=0.0006).

After controlling for income per person and HIV rate, number of breast cancer new cases was NOT statistically significant to life expectancy, and it does NOT have enough evidence to reject the null hypothesis (P-value=0.9040). In this case, odds ratio of 1.004 and confidence interval between 0.947 and 1.064 are no longer meaningful.

Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.

The primary explanatory variable in this model is income per person, while the categorical response variable is life expectancy. The above result supported the hypothesis for the association between income per person and life expectancy.  It was because income per person was significantly and positively associated with life expectancy before (Odds Ratio=1.001, P-value<.0001) and after (Odds Ratio=1.002, P-value<.0001) controlling for potential confounding factors including breast cancer and HIV rate.

Discuss whether or not there was evidence of confounding for the association between your primary explanatory variable and the response variable.

There is NO evidence of confounding for the association between income per person and life expectancy. I added additional explanatory variables, breast cancer and HIV rate, to the model one at a time to identify whether they were confounding variables. NO variables in the test confound the relationship between income and life expectancy. The p-value before controlling for the above mentioned potential confounding factors was less than 0.0001, while the p-value after controlling the potential confounding factors was still less than 0.0001. Income and life expectancy were significantly associated before and after controlling for other factors in this model.

 

Logistic Regression Output

Check the Means

20160213 check mean

Logistic Regression Model for All Explanatory Variables and Life Expectancy

20160213 log regression 5 20160213 log regression 6

Logistic Regression Model for Income (Primary Explanatory Variable) and Life Expectancy

20160213 log regression 1 20160213 log regression 2

Logistic Regression Model for Income, Breast Cancer, and Life Expectancy

20160213 log regression 3 20160213 log regression 4

 

SAS Program

/* Start the data step */

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new2; set mydata.gapminder;

 

/* Bin response variable lifeexpectancy into 2 categories */

IF lifeexpectancy LE 70 AND lifeexpectancy GE 0 THEN lifeexpectancygroup=0;

ELSE IF lifeexpectancy LE 90 AND lifeexpectancy GT 70 THEN lifeexpectancygroup=1;

 

/* Center explanatory variables */

incomeperperson_c=incomeperperson-8740.97;

breastcancerper100th_c=breastcancerper100th-37.4028902;

HIVrate_c=HIVrate-1.9354422;

run;

 

/* Check means for centered explanatory variables */

PROC means; var incomeperperson_c breastcancerper100TH_c HIVrate_c;

run;

 

/* Run logistic regression and test confounder*/

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c;

run;

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c breastcancerper100th_c;

run;

PROC LOGISTIC descending; model lifeexpectancygroup=incomeperperson_c breastcancerper100th_c HIVrate_c;

run;

Regression Modeling in Practice – Test a Multiple Regression Model

SAS Program

Samples

143 countries

Variables

Quantitative Response Variable: lifeexpectancy (representing life expectancy)

Primary Quantitative Explanatory Variable: incomeperperson (representing income per person in a year in US Dollars)

The Second Quantitative Explanatory Variable: breastcancerper100th (representing number of breast cancer new case per 100,000 females)

The Third Quantitative Explanatory Variable: HIVrate (representing HIV rate)

SAS Program

/* Start the data step */

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new2; set mydata.gapminder;

 

/* Center explanatory variables */

incomeperperson_c=incomeperperson-8740.97;

breastcancerper100th_c=breastcancerper100th-37.4028902;

HIVrate_c=HIVrate-1.9354422;

 

/* Create secondary variables for quaratic regression */

incomeperperson2=incomeperperson_c*incomeperperson_c;

breastcancerper100th2=breastcancerper100th_c*breastcancerper100th_c;

HIVrate2=HIVrate_c*HIVrate_c;

run;

 

/* Check means for centered explanatory variables */

PROC means; var incomeperperson_c breastcancerper100TH_c HIVrate_c;

run;

 

/* Test confounders for income per person*/

PROC glm; model lifeexpectancy=incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=incomeperperson_c breastcancerper100th_c/solution;

run;

PROC glm; model lifeexpectancy=incomeperperson_c breastcancerper100th_c HIVrate_c/solution;

run;

 

/* Test confounders for breast cancer number*/

PROC glm; model lifeexpectancy=breastcancerper100th_c/solution;

run;

PROC glm; model lifeexpectancy=breastcancerper100th_c incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=breastcancerper100th_c incomeperperson_c HIVrate_c/solution;

run;

 

/* Test confounders for HIV rate*/

PROC glm; model lifeexpectancy=HIVrate_c/solution;

run;

PROC glm; model lifeexpectancy=HIVrate_c incomeperperson_c/solution;

run;

PROC glm; model lifeexpectancy=HIVrate_c incomeperperson_c breastcancerper100th_c/solution;

run;

 

/* Create residual plot */

PROC glm PLOTS (unpack)=all;

model lifeexpectancy=incomeperperson_c incomeperperson_c*incomeperperson_c breastcancerper100th_c breastcancerper100th_c*breastcancerper100th_c HIVrate_c HIVrate_c*HIVrate_c/solution clparm;

output residual=res student=stdres out=results;

 

/* Generate standardized residuals for all observations */

PROC gplot;

label stdres=”Standardized Residual” country=”Country”;

plot stdres*country/vref=0;

run;

 

/* Test a multiple regression model and generate regression diagnostic plots */

PROC reg plots=partial;

model lifeexpectancy=incomeperperson_c incomeperperson2 breastcancerper100th_c breastcancerper100th2 HIVrate_c HIVrate2/partial;

run;

 

Output

Multiple Regression Model for All Explanatory Variables, Adding Quadratic Term

20160207 summary 1 20160207 summary 2

Regression Model for Income and Life Expectancy 

20160207 confound part 1 20160207 confound part 2

Multiple Regression Model for Income, Breast Cancer, and Life Expectancy 

20160207 confound part 3 20160207 confound part 4

Multiple Regression Model for Income, Breast Cancer, HIV Rate, and Life Expectancy 

20160207 confound part 5 20160207 confound part 6

Q-Q Plot

20160207 q-q plot 1

Standardized Residuals for All Observations

20160207 Standardized Residual Plot 1

Leverage Plot

20160207 leverage plot 1

Partial Regression Residual Plots

20160207 partial plots part 1 20160207 partial plots part 2

Summarize what you found. Discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (Beta coefficients and p-values) in your summary.

According to the multiple regression model with quadratic term added (the first model in the Output session), after controlling for potential confounding factors including breast cancer per 100,000 females and HIV rate, income per person (regression coefficient=0.00077510, p<.0001) is significantly and positively associated with life expectancy.  HIV rate (regression coefficient=-2.31347872, p<.0001) is significantly and negatively associated with life expectancy after controlling for income per person and breast cancer per 100,000 females.  However, after controlling for income and HIV rate, breast cancer (regression coefficient=0.06891852, p=0.0517) is not statistically significant to life expectancy, and it does NOT have enough evidence to reject the null hypothesis even though the p-value was so close to 0.05.

Income per person, breast cancer per 100,000 females, and HIV rate explain about 76.55% (r² = 0.7655 ) of the variability in life expectancy.

 

Report whether or not your results supported your hypothesis for the association between your primary explanatory and response variable.

The primary explanatory value for this case is income per person, while the response variable is life expectancy. The above result supported the hypothesis for the association between income per person and life expectancy.  It is because income per person (regression coefficient=0.00077510, p<.0001) was significantly and positively associated with life expectancy before and after controlling for potential confounding factors including breast cancer and HIV rate.

 

Discuss whether or not there was evidence of confounding for the association between your primary explanatory and response variable.

There is NO evidence of confounding for the association between income per person and life expectancy. I added additional explanatory variables, breast cancer and HIV rate, to the model one at a time to identify whether they are confounding variables. NO variables in the test confound the relationship between income and life expectancy. The p-value before controlling for the above mentioned potential confounding factors was less than 0.0001, while the p-value after controlling the potential confounding factors was 0.0003. Income and life expectancy are significantly associated before and after controlling for other factors.

 

Generate regression diagnostic plots and write a few sentences describing what these plots tell you about your regression model in terms of the distribution of the residuals, model fit, influential observations, and outliers.

Q-Q Plot

The Q-Q Plot (located in the above Output session) does not have a perfect straight line. It means that the residuals do not perfectly follow the normal distribution, particularly the lower and upper parts. It could mean that the curvilinear association may not be fully estimated by the quadratic income term.

Standardized Residuals for All Observations

In the Standardized Residuals Plot for 143 observations/countries (plot located in the above Output session), there are 4 outliers in 2.5 standard deviation from the mean. It means that only 97.2% (less than 99%) of countries are in 2.5 standard deviation. In addition, there are 9 outliers in 2 standard deviation from the mean. It means that only 93.7% (less than 95%) of countries are in 2 standard deviation. In other words, residual values may seem close to but not perfectly fit a standard, normal distribution.

Leverage Plot

The leverage plot (located in the above Output session) shows that there are 9 outliers that are greater than 2 or less than -2.  7 of the outliers are close to the leverage value of 0 (between 0 and 0.1).  Only 2 outliers with higher leverage value have stronger influence on the estimation of the regression parameters.

Partial Regression Residual Plots

The partial plot of income per person (in the Output session) shows relationship between life expectancy and income after controlling for other explanatory variables, the breast cancer number and HIV rate.  Partial plots of breast cancer number and HIV rate are also presented in the Output session. In all three plots, many residuals are far from the regression lines. Although the positive relationship between income and life expectancy and the negative relationship between HIV rate and life expectancy are statistically significant, the associations are weak after controlling for other explanatory variables.

 

Regression Modeling in Practice – Test a Basic Linear Regression Model

Research Question

Is Gross Domestic Product of the country related to its citizens’ average life expectancy at birth?

Variables

Quantitative Explanatory Variable: incomeperperson (representing annual income per person in US Dollar)

Quantitative Response Variable: lifeexpectancy (representing the average number of years a newborn child would live)

Centered Explanatory Variable: cincomeperperson  (the mean is very close to zero)

Centered Explanatory Variable without Two Extreme Outliers: nocincomeperperson (the mean is very close to zero)

Program

/* Start the data step */
LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;
DATA new; set mydata.gapminder;

/* Assign label names for variables */
LABEL incomeperperson=”Income Per Person” /*”Income Per Person – 2010 Gross Domestic Product Per Capita in Constant 2000 US$”*/
lifeexpectancy=”Life Expectancy” /*”2011 Average Number of Years a Newborn Child Would Live”*/
cincomeperperson=”Centered Income Per Person”
nocincomeperperson=”Centered Income Per Person without Outliers”

/* set omitted values to missing */
IF incomeperperson=’ ‘ THEN incomeperperson=.;
IF lifeexpectancy=’ ‘ THEN lifeexpectancy=.;

/* create new variable called cincomeperperson to center the explanatory variable (incomeperperson) by subtracting the mean */
cincomeperperson=incomeperperson-8740.9655;

/* create another new variable called nocincomeperperson to center the explanatory variable (incomeperperson) and for removing two extreme outliers*/
nocincomeperperson=incomeperperson-8509.72;
IF country=’Equatorial Guinea’ THEN nocincomeperperson=.;
ELSE IF country=’Luxembourg’ THEN nocincomeperperson=.;

PROC SORT; by COUNTRY;

/* Calculate the means of cincomeperperson and nocincomeperperson to check the centering. Means should be zero or very close to zero*/
PROC MEANS; var cincomeperperson;
PROC MEANS; var nocincomeperperson;

Run;

/* Test a linear regression model */
PROC GLM; Model lifeexpectancy=cincomeperperson /solution;
PROC GLM; Model lifeexpectancy=nocincomeperperson /solution;

Run;

Output for Checking the Centering

The quantitative explanatory variable, incomeperperson, was centered by subtracting the mean. A new variable named cincomeperperson was created and served as the centered variable of incomeperperson. cincomeperperson has the mean which is very close to zero. The means procedure is as follows:

20160131cincomemean

In order to avoid distorting regression coefficients in the test, another new centered variable, called nocincomeperperson was created with the removal of two extreme outliers. nocincomeperperson also has the mean which is very close to zero. The means procedure is as follows:

20160131nocincomemean

Output and Result for the Linear Regression Model

(With and Without Extreme Outliers)

Output with Outliers

20160131cincomeoutput1

20160131cincomeoutput2-2

20160131cincomeoutput3

There were 176 countries in this test, including extreme outliers. The result of the linear regression model indicated that income per person (Beta/Regression Coefficient=0.00055, p-value<.0001) was significantly and positively associated with life expectancy.

The r² (r square) of 0.3618 suggests that if we know the income per person, we can predict 36% of the variability we will see in life expectancy.

Output without Two Extreme Outliers

20160131nocincomeoutput1 20160131nocincomeoutput2 20160131nocincomeoutput3

There were 174 countries in this test. Two extreme outliers were removed. The result of the linear regression model indicated that income per person (Beta/Regression Coefficient=0.00059, p-value<.0001) was significantly and positively associated with life expectancy.

The r² (r square) of 0.3824 suggests that if we know the income per person, we can predict 38% of the variability we will see in life expectancy.

The conclusion did not change significantly after the two extreme outliers were removed.

 

Regression Modeling in Practice – Introduction to Regression

Research Question

Is Gross Domestic Product per capita of the countries related to citizens’ average life expectancy at birth, after controlling for potential confounders including sex ratio, government expenditure on health, education, and food supply?

 

Sample

The sample of developed and developing countries and territories from Asia, Africa, America, Europe, Middle East, and Australia was drawn from Gapminder, the dataset that seeks to increase the use and understanding of statistics about social, economic, and environmental development at the global level. 190 countries (n=190) were assigned into 5 income groups: The lowest 20% (n=38), 21%-40% (n=38), 41%-60% (n=38), 61%-80% (n=38), and the highest 20% (n=38). The majority of countries have the average number of 70 to 80 years newborn children would live, spend less than US$1,200 government expenditure on health per capita, have food supply of 2,500 to 3,000 kilo calories for each person each day, have 90% to 110% of citizens graduated from primary schools (over 100% due to over-aged and under-aged children who enter primary schools late or early), and have sex ratio from 101 to 107 (male divided by female, per 100) among population between 0 and 14 years old. The data analytic sample for this study included all countries and territories who reported Gross Domestic Product per capita in year 2010 and life expectancy at birth in year 2011.

 

Procedure

Data from Gapminder are observational, quantitative data generated through data reporting from different sources, including the World Bank, Human Morality Database, World Population Prospects, publications and files by history professor James C. Riley, Human Lifetable Database, UN Population Division, World Health Organization, and FAO Stat.  Data from those sources were collected through nations’ official sources and non-official life tables produced by researchers. Data were collected on an ongoing basis.  In the current analysis, data of year 2010 Gross Domestic Product, 2011 life expectancy, 2007 food supply, 2010 per capita government expenditure on health, 2015 sex ratio, and 2010 primary school completion rate are used in this study.

 

Measures

The measure of income per person, the explanatory variable, was drawn from country level data compiled by the World Bank Work Development Indicators (http://data.worldbank.org/data-catalog/world-development-indicators) and made available for download through the Gapminder web site (www.gapminder.org). It measured 2010 Gross Domestic Product per capita in constant 2000 US Dollars. For the current analysis, countries were binned into five income levels: The lowest 20% (n=38), 21%-40% (n=38), 41%-60% (n=38), 61%-80% (n=38), and the highest 20% (n=38).

The measure of life expectancy at birth, the response variable, was drawn from country level data compiled by multiple sources including Human Morality Database, World Population Prospects, publications and files by history professor James C. Riley, and Human Lifetable Database and made available for download through the Gapminder web site. It measured the average number of years a newborn child would live if current mortality patterns were to stay the same in 2011.  For the current analysis, countries were binned into 5 categories: 40.001-50 years of age, 50.001-60 years, 60.001-70 years, 70.001-80 years, and 80.001-90 years.

The measures of the confounding variables, including sex ratio, government expenditure on health per capita, primary school completion rate, and food supply, were drawn from country level data compiled by UN Population Division, World Health Organization, World Bank, and FAO Stat. Data from those sources were available for download through the Gapminder website. For the current analysis, countries were binned into five categories for each of these confounding variables.

Data Analysis Tools – Testing a Potential Moderator

RESEARCH QUESTION

Does CO2 emission moderates the relationship between income per person and life expectancy from a global perspective? In other words, is income per person of the countries related to their citizens’ average life expectancy for each level of the countries’ CO2 emission?

CO2 emission is selected as a moderating variable because CO2 affects health, which is related to life expectancy. On the other hand, the rich population is believed to produce more CO2. Therefore, this study is to determine whether CO2 emission moderates the relationship between income and life expectancy.

VARIABLES

Quantitative Explanatory Variable: incomeperperson (Income Per Person in US Dollars)

Quantitative Response Variable: lifeexpectancy (Average Number of Years a Newborn Child Would Live)

Categorical Moderating Variable: co2emissions (Cumulative CO2 emission in metric tons since 1751)

Countries are divided into three groups based on levels of CO2 emission:

Group 1 = 0 – 60,000,000 metric tons

Group 2 = 60,000,001 – 1,000,000,000 metric tons

Group 3 = 1,000,000,001 – 334,221,000,000 metric tons

SAS PROGRAM FOR TESTING MODERATION IN THE CONTEXT OF PEARSON CORRELATION COEFFICIENT

/* Start the data step */
LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;
DATA new; set mydata.gapminder;

/* Assign label names for variables */
LABEL incomeperperson=”Income Per Person” /*”Income Per Person – 2010 Gross Domestic Product Per Capita in Constant 2000 US$”*/
lifeexpectancy=”Life Expectancy” /*”2011 Average Number of Years a Newborn Child Would Live”*/
co2emissions=”2006 cumulative CO2 emission (metric tons) since 1751″;

/* Group values of the third variable – moderator */
IF co2emissions LE 60000000 AND co2emissions GE 0 THEN co2emissionsgroup=1;
ELSE IF co2emissions LE 1000000000 AND co2emissions GT 60000000 THEN co2emissionsgroup=2;
ELSE IF co2emissions LE 334221000000 AND co2emissions GT 1000000000 THEN co2emissionsgroup=3;

/* PROC SORT */
PROC SORT; by COUNTRY;
PROC SORT; by co2emissionsgroup;

/* Run Pearson Correlation Coefficient and Test Moderation with the Third Variable*/
PROC CORR; VAR lifeexpectancy incomeperperson; BY co2emissionsgroup;
RUN;

/* Create Scatter Plot */

PROC GPLOT; PLOT lifeexpectancy*incomeperperson; BY co2emissionsgroup;

Post Hoc Test is NOT necessary for Pearson Correlation Coefficient and is only for categorical variable. Income level (explanatory variable) and life expectancy (response variable) in this test are quantitative variables. 

Please click for larger image:

20160110 Moderator SAS Code 001

OUTPUT 

20160110 Moderator Result 001 20160110 Moderator Result 003 20160110 Moderator Result 004 20160110 Moderator Result 005 20160110 Moderator Result 006

INTERPRETATION 

For the low CO2 emission group (Group 1), the correlation (r) between income per person and life expectancy is 0.46919 with a significant p-value at 0.0011. The relationship between income per person and life expectancy for the low CO2 emission group is statistically significant. The r reflects that the two variables have a moderate, positive relationship for the low CO2 emission group. The r² (r square) of 0.2201 suggests that if we know the income per person, we can predict 22.01% of the variability we will see in life expectancy for the low CO2 emission group.

For the moderate CO2 emission group (Group 2), the correlation (r) between income per person and life expectancy is 0.48539 with a significant p-value less than 0.0001. The relationship between income per person and life expectancy for the moderate CO2 emission group is statistically significant. The r reflects that the two variables have a moderate, positive relationship for the moderate CO2 emission group. The r² (r square) of 0.2356 suggests that if we know the income per person, we can predict 23.56% of the variability we will see in life expectancy for the moderate CO2 emission group.

For the high CO2 emission group (Group 3), the correlation (r) between income per person and life expectancy is 0.69289 with a significant p-value less than 0.0001. The relationship between income per person and life expectancy for the high CO2 emission group is statistically significant. The r reflects that the two variables have a strong, positive relationship for the high CO2 emission group. The r² (r square) of 0.4801 suggests that if we know the income per person, we can predict 48.01% of the variability we will see in life expectancy for the high CO2 emission group.

Both directions and strengths of the relationships between income per person and life expectancy are similar for the low, moderate, and high CO2 emission groups. Although the high CO2 emission group shows a slightly stronger relationship between income per person and life expectancy, the levels of CO2 emission do not moderate the relationship between income and life expectancy. For all levels of CO2 emission, countries with higher income are associated with longer lives of their citizens.

Data Analysis Tools – Generating a Correlation Coefficient

Research Question

Is lower income associated with worse health from a global perspective?

SAS PROGRAM FOR THE PEARSON CORRELATION COEFFICIENT

/* Run Pearson Correlation Coefficient */
PROC CORR; VAR alcconsumption breastcancerper100TH HIVrate lifeexpectancy suicideper100TH incomeperperson;
RUN;

/* Create Scatter Plot */
PROC GPLOT; PLOT alcconsumption*incomeperperson;
PROC GPLOT; PLOT breastcancerper100TH*incomeperperson;
PROC GPLOT; PLOT HIVrate*incomeperperson;
PROC GPLOT; PLOT lifeexpectancy*incomeperperson;
PROC GPLOT; PLOT suicideper100TH*incomeperperson;

Please click the following images for larger images. Syntax for the Pearson Correlation Coefficient is highlighted in yellow.

20150103 Correlation Coefficient SAS Code 001 20150103 Correlation Coefficient SAS Code 002 20150103 Correlation Coefficient SAS Code 003 20150103 Correlation Coefficient SAS Code 004 20150103 Correlation Coefficient SAS Code 005 20150103 Correlation Coefficient SAS Code 006 20150103 Correlation Coefficient SAS Code 007 20150103 Correlation Coefficient SAS Code 008 20150103 Correlation Coefficient SAS Code 009 20150103 Correlation Coefficient SAS Code 010 20150103 Correlation Coefficient SAS Code 011 20150103 Correlation Coefficient SAS Code 012 20150103 Correlation Coefficient SAS Code 013 20150103 Correlation Coefficient SAS Code 014 20150103 Correlation Coefficient SAS Code 015 20150103 Correlation Coefficient SAS Code 016

OUTPUT FOR THE PEARSON CORRELATION COEFFICIENT

The quantitative explanatory variable that reflects income:

incomeperperson (Income Per Person)

The quantitative response variables that reflect health:

alcconsumption (Alcohol Consumption Per Capita, Age 15+)

breastcancerper100TH (Number of New Cases of Breast Cancer in 100,000 Females)

HIVrate (Percentage of People with HIV, Ages 15-49)

lifeexpectancy (Average Number of Years a Newborn Child Would Live)

suicideper100TH (Number of Suicide in 100,000 People — reflects mental health)

20150103 Correlation Coefficient SAS Output_table20150103 Correlation Coefficient SAS Output_scatterplot1
20150103 Correlation Coefficient SAS Output_scatterplot2
20150103 Correlation Coefficient SAS Output_scatterplot320150103 Correlation Coefficient SAS Output_scatterplot4

20150103 Correlation Coefficient SAS Output_scatterplot5

INTERPRETATION FOR THE PEARSON CORRELATION COEFFICIENT

Association between Income Per Person and Alcohol Consumption:

The relationship between income and alcohol consumption is statistically significant and the null hypothesis can be rejected because the p-value is less than .0001.  The r of 0.29539 reflects that the two variables have a weak, positive relationship. The r² (r square) of 0.087 suggests that if we know the income per person, we can predict only 8.7% of the variability we will see in alcohol consumption.

Excessive alcohol use can lead to the development of chronic diseases and other serious health problems.  The correlation coefficient shows that people in countries with higher income consume more alcohol that can harm their health, but the relationship is weak.

Association between Income Per Person and Number of New Cases of Breast Cancer:

The relationship between income and number of new cases of breast cancer is statistically significant and the null hypothesis can be rejected because the p-value is less than .0001.  The r of 0.73140 reflects that the two variables have a strong, positive relationship. The r² (r square) of 0.5349 suggests that if we know the income per person, we can predict 53.49% of the variability we will see in number of new cases of breast cancer.

Many factors such as lifestyle, age, having children and hormone replacement therapy associate with the risk of breast cancer. The correlation coefficient shows that people in countries with higher income have higher number of new cases of breast cancer.

Association between Income Per Person and HIV Rate:

The relationship between income and HIV rate is statistically significant and the null hypothesis can be rejected because the p-value is 0.0167.  The r of -0.19845 reflects that the two variables have a very weak, negative relationship. The r² (r square) of 0.039 suggests that if we know the income per person, we can predict only 3.9% of the variability we will see in HIV rate.

Association between Income Per Person and Life Expectancy:

The relationship between income and life expectancy is statistically significant and the null hypothesis can be rejected because the p-value is less than .0001.  The r of 0.60152 reflects that the two variables have a moderate, positive relationship. The r² (r square) of 0.3618 suggests that if we know the income per person, we can predict 36% of the variability we will see in life expectancy.

NO Association between Income Per Person and Number of Suicides:

The majority of people who commit suicide have a diagnosable mental disorder; and therefore, suicide number is served as an indicator of mental health in this test. The relationship between income and number of suicides is NOT statistically significant, and the two variables are unrelated because the p-value is 0.9302.  The r of 0.00656 reflects that the two variables have very weak or even NO relationship. The r² (r square) of 0.000043 suggests that if we know the income per person, we cannot predict the number of suicides because there is almost 0% of the variability.

Conclusion

Based on the above correlation coefficient, the positive relationship of income and life expectancy supports that people in countries with lower income have shorter life expectancy. Although HIV rate decreases when income per person increases, the relationship is very weak. The positive relationships of income and alcohol consumption as well as income and number of new cases of breast cancer reject the idea that higher income links to better health. Suicide number is not associated with income. In conclusion, there is not enough evidence that people in countries with higher income have better health, and vice versa. But people in richer countries have longer lives.

Data Analysis Tools – Running a Chi-Square Test of Independence

Research Question for the Chi-Square Test:

Is higher income associated with longer life from a global perspective?

Quantitative explanatory and response variables were changed to categorical variables for running the Chi-Square Test.

Null Hypothesis (H0)

There is no relationship between income level and life expectancy. They are independent.

Alternative Hypothesis (Ha)

There is a relationship between income level and life expectancy. They are not independent.

 

SAS Program for the Chi-Square Test

Please click the following images for larger images. Syntax for the Chi-Square Test is highlighted in yellow.

20151227 Chi Square SAS Screenshot 001 20151227 Chi Square SAS Screenshot 002 20151227 Chi Square SAS Screenshot 003 20151227 Chi Square SAS Screenshot 004 20151227 Chi Square SAS Screenshot 005 20151227 Chi Square SAS Screenshot 006 20151227 Chi Square SAS Screenshot 007 20151227 Chi Square SAS Screenshot 008 20151227 Chi Square SAS Screenshot 009 20151227 Chi Square SAS Screenshot 010 20151227 Chi Square SAS Screenshot 011 20151227 Chi Square SAS Screenshot 012 20151227 Chi Square SAS Screenshot 013 20151227 Chi Square SAS Screenshot 014 20151227 Chi Square SAS Screenshot 015

 

Output for the Chi-Square Test

 20151227 Chi Square SAS Output 001 20151227 Chi Square SAS Output 002 20151227 Chi Square SAS Output 003 20151227 Chi Square SAS Output 004 20151227 Chi Square SAS Output 005 20151227 Chi Square SAS Output 006 20151227 Chi Square SAS Output 007 20151227 Chi Square SAS Output 008 20151227 Chi Square SAS Output 009 20151227 Chi Square SAS Output 010

 

Interpretation for the Chi-Square Test

Five Income Groups:

Group 1 = US$0 – US$559 (Lowest 20%)

Group  2 = US$560 – US$1,845 (21%-40%)

Group  3 = US$1,846 – US$4,700 (41%-60%)

Group  4 = US$4,701 – US$13,578 (61%-80%)

Group  5 = US$13,579 – US$105,148 (81%-100%/Highest 20%)

Two Life Expectancy Groups:

Group  1 = 40.001 – 60 years of age

Group  2 = 60.001 – 90 years of age

When examining the association between average life expectancy (the response variable is categorized) and income per person (the explanatory variable is categorized), a Chi-Square Test of Independence revealed that among 176 countries, citizens of those with higher income were likely to live longer compared to citizens of those with lower income, Chi-Square (X2) = 109.1614, 4 degree of freedom (df), p<0.0001. Per the Chi-Square table, 100% of countries in the highest income group have the average life expectancy of 60.001 to 90 years old. 96.97% of countries in Income Group 4 and 75.68% in Income Group 3 have the average life expectancy of 60.001 to 90 years old. On the other hand, 63.89% in Income Group 2 and 100% in the lowest income group have the average life expectancy of 40.01 to 60 years of age.

The degree of freedom (df) is the number of levels of the explanatory variable minus 1. In this case, the df is 4 income per person which has 5 levels (df 5-1=4).

 

Interpretation for the Post Hoc Chi-Square Test results

20151227 Chi Square p value table

Post hoc comparisons of rates of income levels (5 categories) by life expectancy (2 categories) revealed that higher income levels were seen among those who live longer.

In this case, the adjusted Bonferroni P Value is 0.005 with 10 comparisons.  The Post Hoc Chi-Square Test showed the following:

*Income Group 1 (Lowest 20%) is significantly different from Income Group 2, 3, 4 and 5.

*Income Group 2 (21%-40%) is significantly different from Income Group 3, 4 and 5.

*Income Group 3 (41%-60%) is significantly different from Income Group 5.

*Income Group 4 (61%-80%) is NOT significantly different from Income Group 5 (Highest 20%), while Income Group 3 is NOT significantly different from Income Group 4.

Post Hoc Chi-Square Test demonstrated that countries with higher income per person have significantly more average number of years a newborn child would live (more than 60 years old). According to the Chi Square table, there are 100% of countries in the highest income group, 96.97% in Income Group 4, and 75.68% in Income Group 3 with citizens live longer than 60 years of age in average. However, there is only 36.11% in Income Group 2 with citizens live more than 60 years in average.  There is no country in the lowest income group has the average life expectancy of more than 60 years.

On the other hand, there are 100% of countries in the lowest income group, 63.89% in Income Group 2, and 24.32% in Income Group 3 with citizens live 60 years of age or less in average. Only 3.03% in Income Group 4 and no country in the highest income group with citizens live 60 years of age or less in average.

Data Analysis Tools – Running an Analysis of Variance

Research Question: Is lower income associated with worse health from a global perspective?

Please click here for my codebook.

Please click here for the entire SAS program.

It is the SAS Syntax to run the Post Hoc Test (Duncan Multiple Range Test) for ANOVA. Please click the image for a larger image.

20151220 ANOVA SAS Syntax Screenshot

SAS program and output for ANOVA F Tests:

1. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Alcohol Consumption

2. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Breast Cancer Per 100,000 Females

3. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and HIV Rate

4. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Life Expectancy

5. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Number of Suicide

1. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Alcohol Consumption

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: alcconsumption (representing Alcohol Consumption)

There are a total of five income groups:

Group 1 = US$0 – US$559 (Lowest 20%)
Group 2= US$560 – US$1,845 (21%-40%)
Group 3= US$1,846 – US$4,700 (41%-60%)
Group 4= US$4,701 – US$13,578 (61%-80%)
Group 5= US$13,579 – US$105,148 (81%-100%/Highest 20%)

Syntax:

/* Run ANOVA */
PROC ANOVA; CLASS incomeperpersongroup;
MODEL alcconsumption=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNalcohol 1 20151220 ANOVA incomeNalcohol 2 20151220 ANOVA incomeNalcohol 3 20151220 ANOVA incomeNalcohol 4

There are a total of five income groups.

Number of countries in this sample = 179

Null Hypothesis: Income Per Person and Alcohol Consumption are NOT related.

Alternative Hypothesis: Income Per Person and Alcohol Consumption ARE related.

F-value = 8.88

P-value = <.0001

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 4.389

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 4.916

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 7.235

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 8.785

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 9.464

When examining the association between alcohol consumption (quantitative response) and income per person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among 179 countries in the sample, there is a significant association between income level and alcohol assumption (P-value = <.0001).

The Duncan Multiple Range Test also showed the following:

* The means of Income Group 1 (lowest 20%)  and Income Group 2 (21%-40%) are significantly different from the means of Group 3, 4 and 5.

* The mean of Income Group 3 (41%-60%) is significantly different from the means of Group 1, 2 and 5.

* The mean of Income Group 4 (61%-80%) is significantly different from the means of Group 1 and 2.

* The mean of Income Group 5 (highest 20%) is significantly different from the means of Group 1, 2 and 3.

It demonstrated that countries with higher income per person consume significantly more alcohol per capita (Adult 15+) in a year.

2. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Breast Cancer Per 100,000 Females

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: breastcancerper100TH (representing Number of New Cases of Breast Cancer Per 100,000 Females)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL breastcancerper100TH=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNbreastcancer 1 20151220 ANOVA incomeNbreastcancer 2 20151220 ANOVA incomeNbreastcancer 3 20151220 ANOVA incomeNbreastcancer 4

There are a total of five income groups.

Number of countries in this sample = 165

Null Hypothesis: Income Per Person and Number of New Cases of Breast Cancer are NOT related.

Alternative Hypothesis: Income Per Person and Number of New Cases of Breast Cancer ARE related.

F-value = 54.06

P-value = <.0001

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 18.841

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 28.551

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 32.321

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 44.813

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 70.053

When examining the association between the number of new cases of breast cancer per 100,000 females (quantitative response) and income per person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among 165 countries in the sample, there is a significant association between income level and the number of new cases of breast cancer (P-value = <.0001).

The Duncan Multiple Range Test also showed the following:

* The mean of Income Group 1 (lowest 20%) is significantly different from the means of Group 2, 3, 4 and 5.

* The means of Income Group 2 (21%-40%) and Income Group 3 (41%-60%) are significantly different from the means of Group 1, 4 and 5.

* The mean of Income Group 4 (61%-80%) is significantly different from the means of Group 1, 2, 3 and 5.

* The mean of Income Group 5 (highest 20%) is significantly different from the means of Group 1, 2, 3 and 4.

It demonstrated that countries with higher income per person have significantly more new cases of breast cancer per 100,000 females in a year.

3. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and HIV Rate

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: hivrate (representing HIV Rate, Ages 15-49)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL HIVrate=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNhivrate 1 20151220 ANOVA incomeNhivrate 2 20151220 ANOVA incomeNhivrate 3 20151220 ANOVA incomeNhivrate 4

There are a total of five income groups.

Number of countries in this sample = 145

Null Hypothesis: Income Per Person and HIV Rate are NOT related.

Alternative Hypothesis: Income Per Person and HIV Rate ARE related.

F-value = 3.50

P-value = 0.0093

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 3.830

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 1.716

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 2.626

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 0.606

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 0.322

When simply examining the p-value 0.0093 of an Analysis of Variance (ANOVA), it revealed that among 145 countries in the sample, there is a significant association between income level and HIV rate.

However, the Duncan Multiple Range Test showed the following:

* The means of Income Group 2 (21%-40%) and 3 (21%-40%) are NOT significantly different from the means of all other income groups.

* The mean of Income Group 1 (lowest 20%) is significantly different from the means of Group 4 (61%-80%) and 5 (highest 20%).

* The mean of Income Group 2 is higher than Income Group 3’s.

It demonstrated that countries with higher income per person may NOT have significantly higher HIV rate in a year.

4. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Life Expectancy

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: lifeexpectancy (representing Average Number of Years a Newborn Child Would Live)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL lifeexpectancy=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNlifeexpectancy 1 20151220 ANOVA incomeNlifeexpectancy 2 20151220 ANOVA incomeNlifeexpectancy 3 20151220 ANOVA incomeNlifeexpectancy 4

There are a total of five income groups.

Number of countries in this sample = 176

Null Hypothesis: Income Per Person and Life Expectancy are NOT related.

Alternative Hypothesis: Income Per Person and Life Expectancy ARE related.

F-value = 83.09

P-value = < .0001

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean =57.081

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 66.996

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 71.234

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 74.842

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 80.402

When examining the association between life expectancy (quantitative response) and income per person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among 176 countries in the sample, there is a significant association between income level and life expectancy (P-value = < .0001 ).

The Duncan Multiple Range Test also showed that the means of all income groups are significantly different.

It demonstrated that countries with higher income per person have significantly more average number of years a newborn child would live.

5. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Number of Suicide

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: suicideper100TH (representing Number of Suicide Per 100,000 people)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL suicideper100TH=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNsuicidenumber 1 20151220 ANOVA incomeNsuicidenumber 2 20151220 ANOVA incomeNsuicidenumber 3 20151220 ANOVA incomeNsuicidenumber 4

There are a total of five income groups.

Number of countries in this sample = 181

Null Hypothesis: Income Per Person and Number of Suicide are NOT related.

Alternative Hypothesis: Income Per Person and Number of Suicide ARE related.

F-value = 0.19

P-value = 0.9456

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 9.670

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 10.436

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 9.375

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 9.404

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 9.453

When examining the p-value 0.9456 of an Analysis of Variance (ANOVA), it revealed that among 181 countries in the sample, there is NO significant association between income level and number of suicide. The data do not provide enough evidence to reject the null hypothesis that income level and number of suicide are NOT related.

In the meantime, the Duncan Multiple Range Test showed that the means of all income groups are NOT significantly different from the means of all other income groups.

Data Management and Visualization – Creating graphs

SAS Program

Research Question: Is lower income associated with worse health from a global perspective?

Please click here for my codebook.

Please click here for the entire SAS program. (Please click the following images for larger images)

1. Assign label names for variables and set unknown values to missing values
Wk4 SAS Code HW Screen 1

2. Create secondary variable and groups for the variablesWk4 SAS Code HW Screen 2

Wk4 SAS Code HW Screen 3

3. Set range of values and interpret responses for tablesWk4 SAS Code HW Screen 4

Wk4 SAS Code HW Screen 5

Wk4 SAS Code HW Screen 6

Wk4 SAS Code HW Screen 7

Wk4 SAS Code HW Screen 8

Wk4 SAS Code HW Screen 9

Wk4 SAS Code HW Screen 10

4. Create univariate and bivariate bar chartsWk4 SAS Code HW Screen 11

Univariate and Bivariate Bar Charts

Please click here for all bar charts.

The above univariate graph of the income per person by countries is unimodal, with its highest peak at US$6,000 which is the mid-point of the interval between US$0 and US$12,000. It is skewed to the right as there are higher frequencies in the lower income ranges.

Wk4 SAS Bar Chart Alcohol Consumption

The above univariate graph of the estimated average alcohol consumption per capita in liters is unimodal, with its highest peak at 5.0 liters. It is skewed to the right as there are higher frequencies in the lower alcohol consumption.

Wk4 SAS Bar Chart Breast Cancer

The above univariate graph of the numbers of breast cancer new case in 100,000 females is trimodal, with its highest peak at 18 cases. The other peaks are at 54 cases and 90 cases. It is skewed to the right as there are higher frequencies in the lower numbers of breast cancer new case.

Wk4 SAS Bar Chart HIVrate

The above univariate graph of the HIV rate is unimodal, with its highest peak at 1.5% which is the mid-point of the interval between 0% and 3%. It is skewed to the right as there are higher frequencies in the lower HIV rate.

Wk4 SAS Bar Chart Life Expectancy

The above univariate graph of the life expectancy is bimodal, with its highest peak at 76 years old which is the mid-point of the interval between 74 and 78 years old. The other peak is at 56 years old which is the mid-point of the internal between 54 and 58 years old. It is skewed to the left as there are higher frequencies at older ages.

Wk4 SAS Bar Chart Suicide Number

The above univariate graph of the numbers of suicide in 100,000 people is unimodal, with its highest peak at 6 cases which is the mid-point of the interval between 4 and 8 cases. It is skewed to the right as there are higher frequencies at lower numbers of case.

Wk4 SAS Bar Chart Income vs Alcohol Consumption

* The bar chart above shows income per person by country to the country’s alcohol consumption Per Capita. There is a positive relationship between the two variables. An increase in income is associated with an increase in alcohol consumption.

* Income group:
Interval 1: US$0 – US$559 (Lowest 20%)
Interval 2: US$560 – US$1,845 (21%-40%)
Interval 3: US$1,846 – US$4,700 (41%-60%)
Interval 4: US$4,701 – US$13,578 (61%-80%)
Interval 5: US$13,579 – US$105,148 (81%-100%/Highest 20%)

Wk4 SAS Bar Chart Income vs Breast Cancer

* The bar chart above shows income per person by country to the country’s breast cancer new cases. There is a positive relationship between the two variables. An increase in income is associated with an increase in numbers of the new breast cancer case.

Wk4 SAS Bar Chart Income vs HIVrate

* The bar chart above shows income per person by country to the country’s HIV rate. The chart does not show a clear relationship between the two variables because it is bimodal. However, the two intervals with the highest income (Interval 4 US$4,701-US$13,578 and Interval 5 US$13,579-US$105,148) have lower HIV rates.

Wk4 SAS Bar Chart Income vs Life Expectancy

* The bar chart above shows income per person by country to the country’s life expectancy. There is a positive relationship between the two variables. An increase in income is associated with an increase in years of age.

Wk4 SAS Bar Chart Income vs Suicide Number

* The bar chart above shows income per person by country to the country’s numbers of suicide case. The suicide numbers are examined because these reflect mental health levels. The chart does not show a clear relationship between the two variables.