Data Analysis Tools – Running a Chi-Square Test of Independence

Research Question for the Chi-Square Test:

Is higher income associated with longer life from a global perspective?

Quantitative explanatory and response variables were changed to categorical variables for running the Chi-Square Test.

Null Hypothesis (H0)

There is no relationship between income level and life expectancy. They are independent.

Alternative Hypothesis (Ha)

There is a relationship between income level and life expectancy. They are not independent.

 

SAS Program for the Chi-Square Test

Please click the following images for larger images. Syntax for the Chi-Square Test is highlighted in yellow.

20151227 Chi Square SAS Screenshot 001 20151227 Chi Square SAS Screenshot 002 20151227 Chi Square SAS Screenshot 003 20151227 Chi Square SAS Screenshot 004 20151227 Chi Square SAS Screenshot 005 20151227 Chi Square SAS Screenshot 006 20151227 Chi Square SAS Screenshot 007 20151227 Chi Square SAS Screenshot 008 20151227 Chi Square SAS Screenshot 009 20151227 Chi Square SAS Screenshot 010 20151227 Chi Square SAS Screenshot 011 20151227 Chi Square SAS Screenshot 012 20151227 Chi Square SAS Screenshot 013 20151227 Chi Square SAS Screenshot 014 20151227 Chi Square SAS Screenshot 015

 

Output for the Chi-Square Test

 20151227 Chi Square SAS Output 001 20151227 Chi Square SAS Output 002 20151227 Chi Square SAS Output 003 20151227 Chi Square SAS Output 004 20151227 Chi Square SAS Output 005 20151227 Chi Square SAS Output 006 20151227 Chi Square SAS Output 007 20151227 Chi Square SAS Output 008 20151227 Chi Square SAS Output 009 20151227 Chi Square SAS Output 010

 

Interpretation for the Chi-Square Test

Five Income Groups:

Group 1 = US$0 – US$559 (Lowest 20%)

Group  2 = US$560 – US$1,845 (21%-40%)

Group  3 = US$1,846 – US$4,700 (41%-60%)

Group  4 = US$4,701 – US$13,578 (61%-80%)

Group  5 = US$13,579 – US$105,148 (81%-100%/Highest 20%)

Two Life Expectancy Groups:

Group  1 = 40.001 – 60 years of age

Group  2 = 60.001 – 90 years of age

When examining the association between average life expectancy (the response variable is categorized) and income per person (the explanatory variable is categorized), a Chi-Square Test of Independence revealed that among 176 countries, citizens of those with higher income were likely to live longer compared to citizens of those with lower income, Chi-Square (X2) = 109.1614, 4 degree of freedom (df), p<0.0001. Per the Chi-Square table, 100% of countries in the highest income group have the average life expectancy of 60.001 to 90 years old. 96.97% of countries in Income Group 4 and 75.68% in Income Group 3 have the average life expectancy of 60.001 to 90 years old. On the other hand, 63.89% in Income Group 2 and 100% in the lowest income group have the average life expectancy of 40.01 to 60 years of age.

The degree of freedom (df) is the number of levels of the explanatory variable minus 1. In this case, the df is 4 income per person which has 5 levels (df 5-1=4).

 

Interpretation for the Post Hoc Chi-Square Test results

20151227 Chi Square p value table

Post hoc comparisons of rates of income levels (5 categories) by life expectancy (2 categories) revealed that higher income levels were seen among those who live longer.

In this case, the adjusted Bonferroni P Value is 0.005 with 10 comparisons.  The Post Hoc Chi-Square Test showed the following:

*Income Group 1 (Lowest 20%) is significantly different from Income Group 2, 3, 4 and 5.

*Income Group 2 (21%-40%) is significantly different from Income Group 3, 4 and 5.

*Income Group 3 (41%-60%) is significantly different from Income Group 5.

*Income Group 4 (61%-80%) is NOT significantly different from Income Group 5 (Highest 20%), while Income Group 3 is NOT significantly different from Income Group 4.

Post Hoc Chi-Square Test demonstrated that countries with higher income per person have significantly more average number of years a newborn child would live (more than 60 years old). According to the Chi Square table, there are 100% of countries in the highest income group, 96.97% in Income Group 4, and 75.68% in Income Group 3 with citizens live longer than 60 years of age in average. However, there is only 36.11% in Income Group 2 with citizens live more than 60 years in average.  There is no country in the lowest income group has the average life expectancy of more than 60 years.

On the other hand, there are 100% of countries in the lowest income group, 63.89% in Income Group 2, and 24.32% in Income Group 3 with citizens live 60 years of age or less in average. Only 3.03% in Income Group 4 and no country in the highest income group with citizens live 60 years of age or less in average.

Data Analysis Tools – Running an Analysis of Variance

Research Question: Is lower income associated with worse health from a global perspective?

Please click here for my codebook.

Please click here for the entire SAS program.

It is the SAS Syntax to run the Post Hoc Test (Duncan Multiple Range Test) for ANOVA. Please click the image for a larger image.

20151220 ANOVA SAS Syntax Screenshot

SAS program and output for ANOVA F Tests:

1. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Alcohol Consumption

2. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Breast Cancer Per 100,000 Females

3. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and HIV Rate

4. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Life Expectancy

5. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Number of Suicide

1. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Alcohol Consumption

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: alcconsumption (representing Alcohol Consumption)

There are a total of five income groups:

Group 1 = US$0 – US$559 (Lowest 20%)
Group 2= US$560 – US$1,845 (21%-40%)
Group 3= US$1,846 – US$4,700 (41%-60%)
Group 4= US$4,701 – US$13,578 (61%-80%)
Group 5= US$13,579 – US$105,148 (81%-100%/Highest 20%)

Syntax:

/* Run ANOVA */
PROC ANOVA; CLASS incomeperpersongroup;
MODEL alcconsumption=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNalcohol 1 20151220 ANOVA incomeNalcohol 2 20151220 ANOVA incomeNalcohol 3 20151220 ANOVA incomeNalcohol 4

There are a total of five income groups.

Number of countries in this sample = 179

Null Hypothesis: Income Per Person and Alcohol Consumption are NOT related.

Alternative Hypothesis: Income Per Person and Alcohol Consumption ARE related.

F-value = 8.88

P-value = <.0001

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 4.389

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 4.916

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 7.235

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 8.785

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 9.464

When examining the association between alcohol consumption (quantitative response) and income per person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among 179 countries in the sample, there is a significant association between income level and alcohol assumption (P-value = <.0001).

The Duncan Multiple Range Test also showed the following:

* The means of Income Group 1 (lowest 20%)  and Income Group 2 (21%-40%) are significantly different from the means of Group 3, 4 and 5.

* The mean of Income Group 3 (41%-60%) is significantly different from the means of Group 1, 2 and 5.

* The mean of Income Group 4 (61%-80%) is significantly different from the means of Group 1 and 2.

* The mean of Income Group 5 (highest 20%) is significantly different from the means of Group 1, 2 and 3.

It demonstrated that countries with higher income per person consume significantly more alcohol per capita (Adult 15+) in a year.

2. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Breast Cancer Per 100,000 Females

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: breastcancerper100TH (representing Number of New Cases of Breast Cancer Per 100,000 Females)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL breastcancerper100TH=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNbreastcancer 1 20151220 ANOVA incomeNbreastcancer 2 20151220 ANOVA incomeNbreastcancer 3 20151220 ANOVA incomeNbreastcancer 4

There are a total of five income groups.

Number of countries in this sample = 165

Null Hypothesis: Income Per Person and Number of New Cases of Breast Cancer are NOT related.

Alternative Hypothesis: Income Per Person and Number of New Cases of Breast Cancer ARE related.

F-value = 54.06

P-value = <.0001

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 18.841

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 28.551

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 32.321

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 44.813

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 70.053

When examining the association between the number of new cases of breast cancer per 100,000 females (quantitative response) and income per person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among 165 countries in the sample, there is a significant association between income level and the number of new cases of breast cancer (P-value = <.0001).

The Duncan Multiple Range Test also showed the following:

* The mean of Income Group 1 (lowest 20%) is significantly different from the means of Group 2, 3, 4 and 5.

* The means of Income Group 2 (21%-40%) and Income Group 3 (41%-60%) are significantly different from the means of Group 1, 4 and 5.

* The mean of Income Group 4 (61%-80%) is significantly different from the means of Group 1, 2, 3 and 5.

* The mean of Income Group 5 (highest 20%) is significantly different from the means of Group 1, 2, 3 and 4.

It demonstrated that countries with higher income per person have significantly more new cases of breast cancer per 100,000 females in a year.

3. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and HIV Rate

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: hivrate (representing HIV Rate, Ages 15-49)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL HIVrate=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNhivrate 1 20151220 ANOVA incomeNhivrate 2 20151220 ANOVA incomeNhivrate 3 20151220 ANOVA incomeNhivrate 4

There are a total of five income groups.

Number of countries in this sample = 145

Null Hypothesis: Income Per Person and HIV Rate are NOT related.

Alternative Hypothesis: Income Per Person and HIV Rate ARE related.

F-value = 3.50

P-value = 0.0093

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 3.830

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 1.716

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 2.626

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 0.606

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 0.322

When simply examining the p-value 0.0093 of an Analysis of Variance (ANOVA), it revealed that among 145 countries in the sample, there is a significant association between income level and HIV rate.

However, the Duncan Multiple Range Test showed the following:

* The means of Income Group 2 (21%-40%) and 3 (21%-40%) are NOT significantly different from the means of all other income groups.

* The mean of Income Group 1 (lowest 20%) is significantly different from the means of Group 4 (61%-80%) and 5 (highest 20%).

* The mean of Income Group 2 is higher than Income Group 3’s.

It demonstrated that countries with higher income per person may NOT have significantly higher HIV rate in a year.

4. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Life Expectancy

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: lifeexpectancy (representing Average Number of Years a Newborn Child Would Live)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL lifeexpectancy=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNlifeexpectancy 1 20151220 ANOVA incomeNlifeexpectancy 2 20151220 ANOVA incomeNlifeexpectancy 3 20151220 ANOVA incomeNlifeexpectancy 4

There are a total of five income groups.

Number of countries in this sample = 176

Null Hypothesis: Income Per Person and Life Expectancy are NOT related.

Alternative Hypothesis: Income Per Person and Life Expectancy ARE related.

F-value = 83.09

P-value = < .0001

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean =57.081

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 66.996

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 71.234

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 74.842

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 80.402

When examining the association between life expectancy (quantitative response) and income per person (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among 176 countries in the sample, there is a significant association between income level and life expectancy (P-value = < .0001 ).

The Duncan Multiple Range Test also showed that the means of all income groups are significantly different.

It demonstrated that countries with higher income per person have significantly more average number of years a newborn child would live.

5. ANOVA F Test (Post Hoc Paired Comparisons) for Income Level and Number of Suicide

SAS Program

Categorical Explanatory Variable: incomeperpersongroup (representing Income Group)

Quantitative Response Variable: suicideper100TH (representing Number of Suicide Per 100,000 people)

There are a total of five income groups (the same income groups used in the test above).

Syntax:

PROC ANOVA; CLASS incomeperpersongroup;
MODEL suicideper100TH=incomeperpersongroup;
MEANS incomeperpersongroup/DUNCAN;
RUN;

SAS Output for ANOVA

20151220 ANOVA incomeNsuicidenumber 1 20151220 ANOVA incomeNsuicidenumber 2 20151220 ANOVA incomeNsuicidenumber 3 20151220 ANOVA incomeNsuicidenumber 4

There are a total of five income groups.

Number of countries in this sample = 181

Null Hypothesis: Income Per Person and Number of Suicide are NOT related.

Alternative Hypothesis: Income Per Person and Number of Suicide ARE related.

F-value = 0.19

P-value = 0.9456

Means for Each Income Group:

Group 1 US$0 – US$559 (Lowest 20%):

Mean = 9.670

Group 2 US$560 – US$1,845 (21%-40%):

Mean = 10.436

Group 3 US$1,846 – US$4,700 (41%-60%):

Mean  = 9.375

Group 4 US$4,701 – US$13,578 (61%-80%):

Mean = 9.404

Group 5 US$13,579 – US$105,148 (81%-100%/Highest 20%):

Mean = 9.453

When examining the p-value 0.9456 of an Analysis of Variance (ANOVA), it revealed that among 181 countries in the sample, there is NO significant association between income level and number of suicide. The data do not provide enough evidence to reject the null hypothesis that income level and number of suicide are NOT related.

In the meantime, the Duncan Multiple Range Test showed that the means of all income groups are NOT significantly different from the means of all other income groups.

Data Management and Visualization – Creating graphs

SAS Program

Research Question: Is lower income associated with worse health from a global perspective?

Please click here for my codebook.

Please click here for the entire SAS program. (Please click the following images for larger images)

1. Assign label names for variables and set unknown values to missing values
Wk4 SAS Code HW Screen 1

2. Create secondary variable and groups for the variablesWk4 SAS Code HW Screen 2

Wk4 SAS Code HW Screen 3

3. Set range of values and interpret responses for tablesWk4 SAS Code HW Screen 4

Wk4 SAS Code HW Screen 5

Wk4 SAS Code HW Screen 6

Wk4 SAS Code HW Screen 7

Wk4 SAS Code HW Screen 8

Wk4 SAS Code HW Screen 9

Wk4 SAS Code HW Screen 10

4. Create univariate and bivariate bar chartsWk4 SAS Code HW Screen 11

Univariate and Bivariate Bar Charts

Please click here for all bar charts.

The above univariate graph of the income per person by countries is unimodal, with its highest peak at US$6,000 which is the mid-point of the interval between US$0 and US$12,000. It is skewed to the right as there are higher frequencies in the lower income ranges.

Wk4 SAS Bar Chart Alcohol Consumption

The above univariate graph of the estimated average alcohol consumption per capita in liters is unimodal, with its highest peak at 5.0 liters. It is skewed to the right as there are higher frequencies in the lower alcohol consumption.

Wk4 SAS Bar Chart Breast Cancer

The above univariate graph of the numbers of breast cancer new case in 100,000 females is trimodal, with its highest peak at 18 cases. The other peaks are at 54 cases and 90 cases. It is skewed to the right as there are higher frequencies in the lower numbers of breast cancer new case.

Wk4 SAS Bar Chart HIVrate

The above univariate graph of the HIV rate is unimodal, with its highest peak at 1.5% which is the mid-point of the interval between 0% and 3%. It is skewed to the right as there are higher frequencies in the lower HIV rate.

Wk4 SAS Bar Chart Life Expectancy

The above univariate graph of the life expectancy is bimodal, with its highest peak at 76 years old which is the mid-point of the interval between 74 and 78 years old. The other peak is at 56 years old which is the mid-point of the internal between 54 and 58 years old. It is skewed to the left as there are higher frequencies at older ages.

Wk4 SAS Bar Chart Suicide Number

The above univariate graph of the numbers of suicide in 100,000 people is unimodal, with its highest peak at 6 cases which is the mid-point of the interval between 4 and 8 cases. It is skewed to the right as there are higher frequencies at lower numbers of case.

Wk4 SAS Bar Chart Income vs Alcohol Consumption

* The bar chart above shows income per person by country to the country’s alcohol consumption Per Capita. There is a positive relationship between the two variables. An increase in income is associated with an increase in alcohol consumption.

* Income group:
Interval 1: US$0 – US$559 (Lowest 20%)
Interval 2: US$560 – US$1,845 (21%-40%)
Interval 3: US$1,846 – US$4,700 (41%-60%)
Interval 4: US$4,701 – US$13,578 (61%-80%)
Interval 5: US$13,579 – US$105,148 (81%-100%/Highest 20%)

Wk4 SAS Bar Chart Income vs Breast Cancer

* The bar chart above shows income per person by country to the country’s breast cancer new cases. There is a positive relationship between the two variables. An increase in income is associated with an increase in numbers of the new breast cancer case.

Wk4 SAS Bar Chart Income vs HIVrate

* The bar chart above shows income per person by country to the country’s HIV rate. The chart does not show a clear relationship between the two variables because it is bimodal. However, the two intervals with the highest income (Interval 4 US$4,701-US$13,578 and Interval 5 US$13,579-US$105,148) have lower HIV rates.

Wk4 SAS Bar Chart Income vs Life Expectancy

* The bar chart above shows income per person by country to the country’s life expectancy. There is a positive relationship between the two variables. An increase in income is associated with an increase in years of age.

Wk4 SAS Bar Chart Income vs Suicide Number

* The bar chart above shows income per person by country to the country’s numbers of suicide case. The suicide numbers are examined because these reflect mental health levels. The chart does not show a clear relationship between the two variables.