A project overview can be found in the README.md document on the main github project page: link
Import the maternal smoking dataset:
CHDS <- read.csv("CHDS.csv")
summary(CHDS)
## bwt gestwks age mnocig
## Min. : 3.300 Min. :29.00 Min. :15.00 Min. : 0.000
## 1st Qu.: 6.800 1st Qu.:39.00 1st Qu.:21.00 1st Qu.: 0.000
## Median : 7.600 Median :40.00 Median :25.00 Median : 0.000
## Mean : 7.516 Mean :39.77 Mean :25.86 Mean : 7.431
## 3rd Qu.: 8.200 3rd Qu.:41.00 3rd Qu.:29.00 3rd Qu.:12.000
## Max. :11.400 Max. :48.00 Max. :42.00 Max. :50.000
## mheight mppwt
## Min. :57.00 Min. : 85.0
## 1st Qu.:63.00 1st Qu.:115.0
## Median :64.00 Median :125.0
## Mean :64.43 Mean :126.9
## 3rd Qu.:66.00 3rd Qu.:135.0
## Max. :71.00 Max. :246.0
hist(CHDS$bwt, main = paste("Histogram of Birth Weight"), xlab = "Birth Weight (lbs)")
boxplot(CHDS$bwt, main = "Box Plot of Birth Weight", ylab = "Birth Weight (lbs)")
hist(CHDS$gestwks, main = paste("Histogram of Gestational Age"), xlab = "Gestational Age (weeks)")
boxplot(CHDS$bwt, main = "Box Plot of Gestational Age", ylab = "Gestational Age (weeks)")
hist(CHDS$age, main = paste("Histogram of Maternal Age"), xlab = "Maternal Age (years)")
boxplot(CHDS$age, main = "Box Plot of Maternal Age", ylab = "Maternal Age (years)")
hist(CHDS$mnocig, main = paste("Histogram of Maternal Smoking"), xlab = "Maternal Smoking (cigarettes/day)")
boxplot(CHDS$mnocig, main = "Box Plot of Maternal Smoking", ylab = "Maternal Smoking (cigarettes/day)")
hist(CHDS$mheight, main = paste("Histogram of Maternal Height"), xlab = "Maternal Height (in)")
boxplot(CHDS$mheight, main = "Box Plot of Maternal Height", ylab = "Maternal Height (in)")
hist(CHDS$mppwt, main = paste("Histogram of Maternal Weight"), xlab = "Maternal Pre-Partum Weight (lbs)")
boxplot(CHDS$mppwt, main = "Box Plot of Maternal Weight", ylab = "Maternal Pre-Partum Weight (lbs)")
The following R Code creates a new variable for body mass index from the maternal weight and pre-partum height data:
CHDS$BMI <- 703 * CHDS$mppwt / ((CHDS$mheight)^2)
The following R code creates a new variable for BMI category:
CHDS$BMI_cat <-
ifelse(CHDS$BMI < 18.5, 0,
ifelse(CHDS$BMI < 25, 1,
ifelse(CHDS$BMI < 30, 2,
ifelse(CHDS$BMI < 35, 3,
ifelse(CHDS$BMI < 40, 4,
5)))))
Note that 0 corresponds to underweight, 1 to normal weight, 2 to overweight, 3 to class I obese, 4 to class II obese, and 5 to class III obese.
The following R code creates a new variable for smoking category. Patients are categorized as 0, non-smokers; 1, light smokers 1-9 cigarettes/day; 2, moderate smokers, 10-19 cigarettes/day; or 3, heavy smokers 20+ cigarettes/day:
CHDS$SMK_cat <-
ifelse(CHDS$mnocig == 0, 0,
ifelse(CHDS$mnocig < 10, 1,
ifelse(CHDS$mnocig < 20, 2,
3)))
The following R code creates a scatter plot matrix:
pairs(~bwt+gestwks+age+mnocig+mheight+mppwt+BMI+BMI_cat+SMK_cat, data = CHDS, main = "Scatterplot Matrix")
pwcorr(CHDS)
## bwt gestwks age mnocig
## 1 bwt 1
## 2 gestwks 0.43\n(<0.01) 1
## 3 age 0\n(0.97) 0\n(0.93) 1
## 4 mnocig -0.18\n(<0.01) -0.07\n(0.06) 0.04\n(0.24) 1
## 5 mheight 0.2\n(<0.01) 0.05\n(0.21) 0.02\n(0.65) 0.03\n(0.5)
## 6 mppwt 0.22\n(<0.01) 0.05\n(0.18) 0.12\n(<0.01) -0.03\n(0.5)
## 7 BMI 0.13\n(<0.01) 0.03\n(0.4) 0.12\n(<0.01) -0.05\n(0.21)
## 8 BMI_cat 0.1\n(<0.01) 0.04\n(0.29) 0.03\n(0.42) 0.04\n(0.33)
## 9 SMK_cat -0.21\n(<0.01) -0.07\n(0.05) 0\n(0.95) 0.94\n(<0.01)
## mheight mppwt BMI BMI_cat SMK_cat
## 1
## 2
## 3
## 4
## 5 1
## 6 0.49\n(<0.01) 1
## 7 -0.06\n(0.11) 0.84\n(<0.01) 1
## 8 -0.04\n(0.34) 0.68\n(<0.01) 0.81\n(<0.01) 1
## 9 0.01\n(0.79) -0.07\n(0.07) -0.09\n(0.02) 0.01\n(0.81) 1
We can perform the Shapiro-Wilk test to to test for normality:
shapiro.test(CHDS$bwt)
##
## Shapiro-Wilk normality test
##
## data: CHDS$bwt
## W = 0.99645, p-value = 0.133
This borderline p-value suggests that more investigation may be needed. A Q-Q plot and boxplot may help:
qqnorm(CHDS$bwt)
boxplot(CHDS$bwt)
These results suggest that birth weight is normally distributed. For completeness we can check the gladder function:
gladder(CHDS$bwt)
None of the transformed distributions look particularly better than the origional distribution.
We can perform the Shapiro-Wilk test to to test for normality:
shapiro.test(CHDS$gestwks)
##
## Shapiro-Wilk normality test
##
## data: CHDS$gestwks
## W = 0.93902, p-value = 4.551e-16
gladder(CHDS$gestwks)
This is very consistent with normal distribution. We will do a Q-Q plot and boxplot for completeness:
qqnorm(CHDS$gestwks)
boxplot(CHDS$gestwks)
These results suggest that gestational age is normally distributed.
The following code evaluates the the correlation coefficients for each of the variables used in the study:
pcor(CHDS)$estimate
## bwt gestwks age mnocig mheight
## bwt 1.000000000 0.42104338 -0.022113163 0.035428811 -0.008273147
## gestwks 0.421043377 1.00000000 0.013115247 -0.020704678 0.040997469
## age -0.022113163 0.01311525 1.000000000 0.116230228 -0.005037463
## mnocig 0.035428811 -0.02070468 0.116230228 1.000000000 0.004806437
## mheight -0.008273147 0.04099747 -0.005037463 0.004806437 1.000000000
## mppwt 0.034889198 -0.04703086 0.008245065 0.000565178 0.992472635
## BMI -0.026128391 0.04181560 0.012905507 0.005292619 -0.985114977
## BMI_cat 0.010126202 0.02693642 -0.117662659 0.010357430 -0.001173643
## SMK_cat -0.098525595 0.02408128 -0.100835398 0.940905720 -0.013587523
## mppwt BMI BMI_cat SMK_cat
## bwt 0.034889198 -0.026128391 0.010126202 -0.09852560
## gestwks -0.047030858 0.041815595 0.026936421 0.02408128
## age 0.008245065 0.012905507 -0.117662659 -0.10083540
## mnocig 0.000565178 0.005292619 0.010357430 0.94090572
## mheight 0.992472635 -0.985114977 -0.001173643 -0.01358752
## mppwt 1.000000000 0.991423771 0.003685339 0.01053122
## BMI 0.991423771 1.000000000 0.101953565 -0.02251821
## BMI_cat 0.003685339 0.101953565 1.000000000 0.03782960
## SMK_cat 0.010531220 -0.022518208 0.037829597 1.00000000
Note that from the above output that the magnitude of the correlation of bwt
with mppwt
and BMI
are almost as large as the magnitude of the correlation of bwt
with mcnocig
. Also worth noting is the magnitude of the correlation between bwt
and gestwks
, although this is not unexpected since birthweight is well documented to be associated with gestation age with much higher risk of low birth weight among pre-term infants.
summary(lm(CHDS$bwt ~ CHDS$gestwks))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$gestwks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1734 -0.6253 0.0266 0.6276 3.3305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.34818 0.80582 -2.914 0.00369 **
## CHDS$gestwks 0.24804 0.02024 12.255 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9891 on 678 degrees of freedom
## Multiple R-squared: 0.1814, Adjusted R-squared: 0.1801
## F-statistic: 150.2 on 1 and 678 DF, p-value: < 2.2e-16
The output above shows that there is very highly statistically significant evidence (P < 0.001) for a linear relationship between bwt
and gestwks
.
summary(lm(CHDS$bwt ~ CHDS$age))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2181 -0.7170 0.0804 0.6851 3.8811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5097117 0.2029260 37.007 <2e-16 ***
## CHDS$age 0.0002614 0.0076786 0.034 0.973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.093 on 678 degrees of freedom
## Multiple R-squared: 1.709e-06, Adjusted R-squared: -0.001473
## F-statistic: 0.001159 on 1 and 678 DF, p-value: 0.9729
The output above shows there is insufficient evidence (P = 0.9729) to suggest a linear relationship between birth-weight and age.
summary(lm(CHDS$bwt ~ CHDS$mnocig))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mnocig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3457 -0.7457 -0.0175 0.6976 3.7543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.645663 0.049406 154.751 < 2e-16 ***
## CHDS$mnocig -0.017386 0.003661 -4.749 2.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 678 degrees of freedom
## Multiple R-squared: 0.03219, Adjusted R-squared: 0.03076
## F-statistic: 22.55 on 1 and 678 DF, p-value: 2.502e-06
The output above shows that there is very highly statistically significant evidence (P < 0.001) that there is a linear association between birthweight and number of cigarettes smoked per day by the mother.
summary(lm(CHDS$bwt ~ CHDS$mheight))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mheight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1778 -0.6833 0.0222 0.7113 4.0113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.77576 1.06676 1.665 0.0964 .
## CHDS$mheight 0.08909 0.01654 5.385 9.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 678 degrees of freedom
## Multiple R-squared: 0.04102, Adjusted R-squared: 0.03961
## F-statistic: 29 on 1 and 678 DF, p-value: 9.976e-08
The output above shows that there is very highly statistically significant evidence (P < 0.001) that there is a linear association between birthweight and maternal height.
summary(lm(CHDS$bwt ~ CHDS$mppwt))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mppwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4210 -0.6892 0.0107 0.6800 3.8415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.798434 0.293231 19.774 < 2e-16 ***
## CHDS$mppwt 0.013539 0.002288 5.917 5.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.066 on 678 degrees of freedom
## Multiple R-squared: 0.0491, Adjusted R-squared: 0.0477
## F-statistic: 35.01 on 1 and 678 DF, p-value: 5.211e-09
The output above shows that there is very highly statistically significant evidence (P < 0.001) for a linear association between birthweight and maternal pre-pregnancy weight.
summary(lm(CHDS$bwt ~ CHDS$BMI))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$BMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3693 -0.7058 0.0108 0.7167 3.8015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.38694 0.34253 18.647 < 2e-16 ***
## CHDS$BMI 0.05262 0.01584 3.322 0.000941 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.084 on 678 degrees of freedom
## Multiple R-squared: 0.01602, Adjusted R-squared: 0.01457
## F-statistic: 11.04 on 1 and 678 DF, p-value: 0.0009409
The output above shows that there is very highly statistically significant evidence (P < 0.001) for a linear association between birth-weight and maternal pre-pregnancy BMI.
summary(lm(CHDS$bwt ~ CHDS$BMI + CHDS$mnocig))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$BMI + CHDS$mnocig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4842 -0.7097 -0.0094 0.7055 3.6819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.586817 0.340253 19.359 < 2e-16 ***
## CHDS$BMI 0.049132 0.015623 3.145 0.00173 **
## CHDS$mnocig -0.016833 0.003642 -4.622 4.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.068 on 677 degrees of freedom
## Multiple R-squared: 0.04612, Adjusted R-squared: 0.0433
## F-statistic: 16.37 on 2 and 677 DF, p-value: 1.144e-07
Including mnocig
in the model still provides significant evidence for the BMI
correlation coefficient (0.001 < P < 0.01), although the magnitude of the coefficient has decreased slightly, from 0.05262 to 0.049132. The R2 value for the model including both BMI
and mnocig
in the model is slightly higher than for when BMI
is used alone, increasing from 0.01602 to 0.04612.
summary(lm(CHDS$bwt ~ CHDS$mppwt + CHDS$mnocig))
##
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mppwt + CHDS$mnocig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5420 -0.6803 -0.0027 0.6549 3.7172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.958320 0.290739 20.494 < 2e-16 ***
## CHDS$mppwt 0.013265 0.002254 5.885 6.26e-09 ***
## CHDS$mnocig -0.016844 0.003575 -4.712 2.98e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.05 on 677 degrees of freedom
## Multiple R-squared: 0.07929, Adjusted R-squared: 0.07657
## F-statistic: 29.15 on 2 and 677 DF, p-value: 7.171e-13
When mnocig
is included in the model for bwt
and mppwt
, the magnitude of the coefficient for mppwt
decreases slightly (from 0.013539 to 0.013265) and the significance level remains about the same. The R2 value increases a fair amount from 0.0491 to 0.07929, suggesting this model is better able to explain the variance in bwt
compared to the model of mppwt
alone.
The following .csv file will be used for building the association model and prediction model:
write.csv(CHDS, file = "CHDS2.csv")