Regression Modeling in Practice – Test a Basic Linear Regression Model

Research Question

Is Gross Domestic Product of the country related to its citizens’ average life expectancy at birth?

Variables

Quantitative Explanatory Variable: incomeperperson (representing annual income per person in US Dollar)

Quantitative Response Variable: lifeexpectancy (representing the average number of years a newborn child would live)

Centered Explanatory Variable: cincomeperperson  (the mean is very close to zero)

Centered Explanatory Variable without Two Extreme Outliers: nocincomeperperson (the mean is very close to zero)

Program

/* Start the data step */
LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;
DATA new; set mydata.gapminder;

/* Assign label names for variables */
LABEL incomeperperson=”Income Per Person” /*”Income Per Person – 2010 Gross Domestic Product Per Capita in Constant 2000 US$”*/
lifeexpectancy=”Life Expectancy” /*”2011 Average Number of Years a Newborn Child Would Live”*/
cincomeperperson=”Centered Income Per Person”
nocincomeperperson=”Centered Income Per Person without Outliers”

/* set omitted values to missing */
IF incomeperperson=’ ‘ THEN incomeperperson=.;
IF lifeexpectancy=’ ‘ THEN lifeexpectancy=.;

/* create new variable called cincomeperperson to center the explanatory variable (incomeperperson) by subtracting the mean */
cincomeperperson=incomeperperson-8740.9655;

/* create another new variable called nocincomeperperson to center the explanatory variable (incomeperperson) and for removing two extreme outliers*/
nocincomeperperson=incomeperperson-8509.72;
IF country=’Equatorial Guinea’ THEN nocincomeperperson=.;
ELSE IF country=’Luxembourg’ THEN nocincomeperperson=.;

PROC SORT; by COUNTRY;

/* Calculate the means of cincomeperperson and nocincomeperperson to check the centering. Means should be zero or very close to zero*/
PROC MEANS; var cincomeperperson;
PROC MEANS; var nocincomeperperson;

Run;

/* Test a linear regression model */
PROC GLM; Model lifeexpectancy=cincomeperperson /solution;
PROC GLM; Model lifeexpectancy=nocincomeperperson /solution;

Run;

Output for Checking the Centering

The quantitative explanatory variable, incomeperperson, was centered by subtracting the mean. A new variable named cincomeperperson was created and served as the centered variable of incomeperperson. cincomeperperson has the mean which is very close to zero. The means procedure is as follows:

20160131cincomemean

In order to avoid distorting regression coefficients in the test, another new centered variable, called nocincomeperperson was created with the removal of two extreme outliers. nocincomeperperson also has the mean which is very close to zero. The means procedure is as follows:

20160131nocincomemean

Output and Result for the Linear Regression Model

(With and Without Extreme Outliers)

Output with Outliers

20160131cincomeoutput1

20160131cincomeoutput2-2

20160131cincomeoutput3

There were 176 countries in this test, including extreme outliers. The result of the linear regression model indicated that income per person (Beta/Regression Coefficient=0.00055, p-value<.0001) was significantly and positively associated with life expectancy.

The r² (r square) of 0.3618 suggests that if we know the income per person, we can predict 36% of the variability we will see in life expectancy.

Output without Two Extreme Outliers

20160131nocincomeoutput1 20160131nocincomeoutput2 20160131nocincomeoutput3

There were 174 countries in this test. Two extreme outliers were removed. The result of the linear regression model indicated that income per person (Beta/Regression Coefficient=0.00059, p-value<.0001) was significantly and positively associated with life expectancy.

The r² (r square) of 0.3824 suggests that if we know the income per person, we can predict 38% of the variability we will see in life expectancy.

The conclusion did not change significantly after the two extreme outliers were removed.

 

Regression Modeling in Practice – Introduction to Regression

Research Question

Is Gross Domestic Product per capita of the countries related to citizens’ average life expectancy at birth, after controlling for potential confounders including sex ratio, government expenditure on health, education, and food supply?

 

Sample

The sample of developed and developing countries and territories from Asia, Africa, America, Europe, Middle East, and Australia was drawn from Gapminder, the dataset that seeks to increase the use and understanding of statistics about social, economic, and environmental development at the global level. 190 countries (n=190) were assigned into 5 income groups: The lowest 20% (n=38), 21%-40% (n=38), 41%-60% (n=38), 61%-80% (n=38), and the highest 20% (n=38). The majority of countries have the average number of 70 to 80 years newborn children would live, spend less than US$1,200 government expenditure on health per capita, have food supply of 2,500 to 3,000 kilo calories for each person each day, have 90% to 110% of citizens graduated from primary schools (over 100% due to over-aged and under-aged children who enter primary schools late or early), and have sex ratio from 101 to 107 (male divided by female, per 100) among population between 0 and 14 years old. The data analytic sample for this study included all countries and territories who reported Gross Domestic Product per capita in year 2010 and life expectancy at birth in year 2011.

 

Procedure

Data from Gapminder are observational, quantitative data generated through data reporting from different sources, including the World Bank, Human Morality Database, World Population Prospects, publications and files by history professor James C. Riley, Human Lifetable Database, UN Population Division, World Health Organization, and FAO Stat.  Data from those sources were collected through nations’ official sources and non-official life tables produced by researchers. Data were collected on an ongoing basis.  In the current analysis, data of year 2010 Gross Domestic Product, 2011 life expectancy, 2007 food supply, 2010 per capita government expenditure on health, 2015 sex ratio, and 2010 primary school completion rate are used in this study.

 

Measures

The measure of income per person, the explanatory variable, was drawn from country level data compiled by the World Bank Work Development Indicators (http://data.worldbank.org/data-catalog/world-development-indicators) and made available for download through the Gapminder web site (www.gapminder.org). It measured 2010 Gross Domestic Product per capita in constant 2000 US Dollars. For the current analysis, countries were binned into five income levels: The lowest 20% (n=38), 21%-40% (n=38), 41%-60% (n=38), 61%-80% (n=38), and the highest 20% (n=38).

The measure of life expectancy at birth, the response variable, was drawn from country level data compiled by multiple sources including Human Morality Database, World Population Prospects, publications and files by history professor James C. Riley, and Human Lifetable Database and made available for download through the Gapminder web site. It measured the average number of years a newborn child would live if current mortality patterns were to stay the same in 2011.  For the current analysis, countries were binned into 5 categories: 40.001-50 years of age, 50.001-60 years, 60.001-70 years, 70.001-80 years, and 80.001-90 years.

The measures of the confounding variables, including sex ratio, government expenditure on health per capita, primary school completion rate, and food supply, were drawn from country level data compiled by UN Population Division, World Health Organization, World Bank, and FAO Stat. Data from those sources were available for download through the Gapminder website. For the current analysis, countries were binned into five categories for each of these confounding variables.

Data Analysis Tools – Testing a Potential Moderator

RESEARCH QUESTION

Does CO2 emission moderates the relationship between income per person and life expectancy from a global perspective? In other words, is income per person of the countries related to their citizens’ average life expectancy for each level of the countries’ CO2 emission?

CO2 emission is selected as a moderating variable because CO2 affects health, which is related to life expectancy. On the other hand, the rich population is believed to produce more CO2. Therefore, this study is to determine whether CO2 emission moderates the relationship between income and life expectancy.

VARIABLES

Quantitative Explanatory Variable: incomeperperson (Income Per Person in US Dollars)

Quantitative Response Variable: lifeexpectancy (Average Number of Years a Newborn Child Would Live)

Categorical Moderating Variable: co2emissions (Cumulative CO2 emission in metric tons since 1751)

Countries are divided into three groups based on levels of CO2 emission:

Group 1 = 0 – 60,000,000 metric tons

Group 2 = 60,000,001 – 1,000,000,000 metric tons

Group 3 = 1,000,000,001 – 334,221,000,000 metric tons

SAS PROGRAM FOR TESTING MODERATION IN THE CONTEXT OF PEARSON CORRELATION COEFFICIENT

/* Start the data step */
LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;
DATA new; set mydata.gapminder;

/* Assign label names for variables */
LABEL incomeperperson=”Income Per Person” /*”Income Per Person – 2010 Gross Domestic Product Per Capita in Constant 2000 US$”*/
lifeexpectancy=”Life Expectancy” /*”2011 Average Number of Years a Newborn Child Would Live”*/
co2emissions=”2006 cumulative CO2 emission (metric tons) since 1751″;

/* Group values of the third variable – moderator */
IF co2emissions LE 60000000 AND co2emissions GE 0 THEN co2emissionsgroup=1;
ELSE IF co2emissions LE 1000000000 AND co2emissions GT 60000000 THEN co2emissionsgroup=2;
ELSE IF co2emissions LE 334221000000 AND co2emissions GT 1000000000 THEN co2emissionsgroup=3;

/* PROC SORT */
PROC SORT; by COUNTRY;
PROC SORT; by co2emissionsgroup;

/* Run Pearson Correlation Coefficient and Test Moderation with the Third Variable*/
PROC CORR; VAR lifeexpectancy incomeperperson; BY co2emissionsgroup;
RUN;

/* Create Scatter Plot */

PROC GPLOT; PLOT lifeexpectancy*incomeperperson; BY co2emissionsgroup;

Post Hoc Test is NOT necessary for Pearson Correlation Coefficient and is only for categorical variable. Income level (explanatory variable) and life expectancy (response variable) in this test are quantitative variables. 

Please click for larger image:

20160110 Moderator SAS Code 001

OUTPUT 

20160110 Moderator Result 001 20160110 Moderator Result 003 20160110 Moderator Result 004 20160110 Moderator Result 005 20160110 Moderator Result 006

INTERPRETATION 

For the low CO2 emission group (Group 1), the correlation (r) between income per person and life expectancy is 0.46919 with a significant p-value at 0.0011. The relationship between income per person and life expectancy for the low CO2 emission group is statistically significant. The r reflects that the two variables have a moderate, positive relationship for the low CO2 emission group. The r² (r square) of 0.2201 suggests that if we know the income per person, we can predict 22.01% of the variability we will see in life expectancy for the low CO2 emission group.

For the moderate CO2 emission group (Group 2), the correlation (r) between income per person and life expectancy is 0.48539 with a significant p-value less than 0.0001. The relationship between income per person and life expectancy for the moderate CO2 emission group is statistically significant. The r reflects that the two variables have a moderate, positive relationship for the moderate CO2 emission group. The r² (r square) of 0.2356 suggests that if we know the income per person, we can predict 23.56% of the variability we will see in life expectancy for the moderate CO2 emission group.

For the high CO2 emission group (Group 3), the correlation (r) between income per person and life expectancy is 0.69289 with a significant p-value less than 0.0001. The relationship between income per person and life expectancy for the high CO2 emission group is statistically significant. The r reflects that the two variables have a strong, positive relationship for the high CO2 emission group. The r² (r square) of 0.4801 suggests that if we know the income per person, we can predict 48.01% of the variability we will see in life expectancy for the high CO2 emission group.

Both directions and strengths of the relationships between income per person and life expectancy are similar for the low, moderate, and high CO2 emission groups. Although the high CO2 emission group shows a slightly stronger relationship between income per person and life expectancy, the levels of CO2 emission do not moderate the relationship between income and life expectancy. For all levels of CO2 emission, countries with higher income are associated with longer lives of their citizens.

Data Analysis Tools – Generating a Correlation Coefficient

Research Question

Is lower income associated with worse health from a global perspective?

SAS PROGRAM FOR THE PEARSON CORRELATION COEFFICIENT

/* Run Pearson Correlation Coefficient */
PROC CORR; VAR alcconsumption breastcancerper100TH HIVrate lifeexpectancy suicideper100TH incomeperperson;
RUN;

/* Create Scatter Plot */
PROC GPLOT; PLOT alcconsumption*incomeperperson;
PROC GPLOT; PLOT breastcancerper100TH*incomeperperson;
PROC GPLOT; PLOT HIVrate*incomeperperson;
PROC GPLOT; PLOT lifeexpectancy*incomeperperson;
PROC GPLOT; PLOT suicideper100TH*incomeperperson;

Please click the following images for larger images. Syntax for the Pearson Correlation Coefficient is highlighted in yellow.

20150103 Correlation Coefficient SAS Code 001 20150103 Correlation Coefficient SAS Code 002 20150103 Correlation Coefficient SAS Code 003 20150103 Correlation Coefficient SAS Code 004 20150103 Correlation Coefficient SAS Code 005 20150103 Correlation Coefficient SAS Code 006 20150103 Correlation Coefficient SAS Code 007 20150103 Correlation Coefficient SAS Code 008 20150103 Correlation Coefficient SAS Code 009 20150103 Correlation Coefficient SAS Code 010 20150103 Correlation Coefficient SAS Code 011 20150103 Correlation Coefficient SAS Code 012 20150103 Correlation Coefficient SAS Code 013 20150103 Correlation Coefficient SAS Code 014 20150103 Correlation Coefficient SAS Code 015 20150103 Correlation Coefficient SAS Code 016

OUTPUT FOR THE PEARSON CORRELATION COEFFICIENT

The quantitative explanatory variable that reflects income:

incomeperperson (Income Per Person)

The quantitative response variables that reflect health:

alcconsumption (Alcohol Consumption Per Capita, Age 15+)

breastcancerper100TH (Number of New Cases of Breast Cancer in 100,000 Females)

HIVrate (Percentage of People with HIV, Ages 15-49)

lifeexpectancy (Average Number of Years a Newborn Child Would Live)

suicideper100TH (Number of Suicide in 100,000 People — reflects mental health)

20150103 Correlation Coefficient SAS Output_table20150103 Correlation Coefficient SAS Output_scatterplot1
20150103 Correlation Coefficient SAS Output_scatterplot2
20150103 Correlation Coefficient SAS Output_scatterplot320150103 Correlation Coefficient SAS Output_scatterplot4

20150103 Correlation Coefficient SAS Output_scatterplot5

INTERPRETATION FOR THE PEARSON CORRELATION COEFFICIENT

Association between Income Per Person and Alcohol Consumption:

The relationship between income and alcohol consumption is statistically significant and the null hypothesis can be rejected because the p-value is less than .0001.  The r of 0.29539 reflects that the two variables have a weak, positive relationship. The r² (r square) of 0.087 suggests that if we know the income per person, we can predict only 8.7% of the variability we will see in alcohol consumption.

Excessive alcohol use can lead to the development of chronic diseases and other serious health problems.  The correlation coefficient shows that people in countries with higher income consume more alcohol that can harm their health, but the relationship is weak.

Association between Income Per Person and Number of New Cases of Breast Cancer:

The relationship between income and number of new cases of breast cancer is statistically significant and the null hypothesis can be rejected because the p-value is less than .0001.  The r of 0.73140 reflects that the two variables have a strong, positive relationship. The r² (r square) of 0.5349 suggests that if we know the income per person, we can predict 53.49% of the variability we will see in number of new cases of breast cancer.

Many factors such as lifestyle, age, having children and hormone replacement therapy associate with the risk of breast cancer. The correlation coefficient shows that people in countries with higher income have higher number of new cases of breast cancer.

Association between Income Per Person and HIV Rate:

The relationship between income and HIV rate is statistically significant and the null hypothesis can be rejected because the p-value is 0.0167.  The r of -0.19845 reflects that the two variables have a very weak, negative relationship. The r² (r square) of 0.039 suggests that if we know the income per person, we can predict only 3.9% of the variability we will see in HIV rate.

Association between Income Per Person and Life Expectancy:

The relationship between income and life expectancy is statistically significant and the null hypothesis can be rejected because the p-value is less than .0001.  The r of 0.60152 reflects that the two variables have a moderate, positive relationship. The r² (r square) of 0.3618 suggests that if we know the income per person, we can predict 36% of the variability we will see in life expectancy.

NO Association between Income Per Person and Number of Suicides:

The majority of people who commit suicide have a diagnosable mental disorder; and therefore, suicide number is served as an indicator of mental health in this test. The relationship between income and number of suicides is NOT statistically significant, and the two variables are unrelated because the p-value is 0.9302.  The r of 0.00656 reflects that the two variables have very weak or even NO relationship. The r² (r square) of 0.000043 suggests that if we know the income per person, we cannot predict the number of suicides because there is almost 0% of the variability.

Conclusion

Based on the above correlation coefficient, the positive relationship of income and life expectancy supports that people in countries with lower income have shorter life expectancy. Although HIV rate decreases when income per person increases, the relationship is very weak. The positive relationships of income and alcohol consumption as well as income and number of new cases of breast cancer reject the idea that higher income links to better health. Suicide number is not associated with income. In conclusion, there is not enough evidence that people in countries with higher income have better health, and vice versa. But people in richer countries have longer lives.