1 Project overview

A project overview can be found in the README.md document on the main github project page: link

2 Import the dataset

Import the maternal smoking dataset:

CHDS <- read.csv("CHDS.csv")

3 Descriptions of individual variables

3.1 Summary statistics for each variable:

summary(CHDS)
##       bwt            gestwks           age            mnocig      
##  Min.   : 3.300   Min.   :29.00   Min.   :15.00   Min.   : 0.000  
##  1st Qu.: 6.800   1st Qu.:39.00   1st Qu.:21.00   1st Qu.: 0.000  
##  Median : 7.600   Median :40.00   Median :25.00   Median : 0.000  
##  Mean   : 7.516   Mean   :39.77   Mean   :25.86   Mean   : 7.431  
##  3rd Qu.: 8.200   3rd Qu.:41.00   3rd Qu.:29.00   3rd Qu.:12.000  
##  Max.   :11.400   Max.   :48.00   Max.   :42.00   Max.   :50.000  
##     mheight          mppwt      
##  Min.   :57.00   Min.   : 85.0  
##  1st Qu.:63.00   1st Qu.:115.0  
##  Median :64.00   Median :125.0  
##  Mean   :64.43   Mean   :126.9  
##  3rd Qu.:66.00   3rd Qu.:135.0  
##  Max.   :71.00   Max.   :246.0

3.2 Birth Weight plots

hist(CHDS$bwt, main = paste("Histogram of Birth Weight"), xlab = "Birth Weight (lbs)")

boxplot(CHDS$bwt, main = "Box Plot of Birth Weight", ylab = "Birth Weight (lbs)")

3.3 Gestational age plots

hist(CHDS$gestwks, main = paste("Histogram of Gestational Age"), xlab = "Gestational Age (weeks)")

boxplot(CHDS$bwt, main = "Box Plot of Gestational Age", ylab = "Gestational Age (weeks)")

3.4 Maternal Age plots

hist(CHDS$age, main = paste("Histogram of Maternal Age"), xlab = "Maternal Age (years)")

boxplot(CHDS$age, main = "Box Plot of Maternal Age", ylab = "Maternal Age (years)")

3.5 Cigarettes smoked plots

hist(CHDS$mnocig, main = paste("Histogram of Maternal Smoking"), xlab = "Maternal Smoking (cigarettes/day)")

boxplot(CHDS$mnocig, main = "Box Plot of Maternal Smoking", ylab = "Maternal Smoking (cigarettes/day)")

3.6 Maternal height plots

hist(CHDS$mheight, main = paste("Histogram of Maternal Height"), xlab = "Maternal Height (in)")

boxplot(CHDS$mheight, main = "Box Plot of Maternal Height", ylab = "Maternal Height (in)")

3.7 Maternal pre-partum weight plots

hist(CHDS$mppwt, main = paste("Histogram of Maternal Weight"), xlab = "Maternal Pre-Partum Weight (lbs)")

boxplot(CHDS$mppwt, main = "Box Plot of Maternal Weight", ylab = "Maternal Pre-Partum Weight (lbs)")

4 Generation of new variables of interest

4.1 BMI

The following R Code creates a new variable for body mass index from the maternal weight and pre-partum height data:

CHDS$BMI <- 703 * CHDS$mppwt / ((CHDS$mheight)^2)

4.2 BMI category

The following R code creates a new variable for BMI category:

CHDS$BMI_cat <- 
  ifelse(CHDS$BMI < 18.5, 0, 
  ifelse(CHDS$BMI < 25, 1,
  ifelse(CHDS$BMI < 30, 2,
  ifelse(CHDS$BMI < 35, 3,
  ifelse(CHDS$BMI < 40, 4, 
  5)))))

Note that 0 corresponds to underweight, 1 to normal weight, 2 to overweight, 3 to class I obese, 4 to class II obese, and 5 to class III obese.

4.3 Smoking category

The following R code creates a new variable for smoking category. Patients are categorized as 0, non-smokers; 1, light smokers 1-9 cigarettes/day; 2, moderate smokers, 10-19 cigarettes/day; or 3, heavy smokers 20+ cigarettes/day:

CHDS$SMK_cat <-
  ifelse(CHDS$mnocig == 0, 0,
  ifelse(CHDS$mnocig < 10, 1,
  ifelse(CHDS$mnocig < 20, 2, 
  3)))

4.4 Birth weight category

5 Scatter plot matrix

The following R code creates a scatter plot matrix:

pairs(~bwt+gestwks+age+mnocig+mheight+mppwt+BMI+BMI_cat+SMK_cat, data = CHDS, main = "Scatterplot Matrix")

6 Pairwise Correlation Matrix

pwcorr(CHDS)
##                      bwt       gestwks           age        mnocig
## 1     bwt              1                                          
## 2 gestwks  0.43\n(<0.01)             1                            
## 3     age      0\n(0.97)     0\n(0.93)             1              
## 4  mnocig -0.18\n(<0.01) -0.07\n(0.06)  0.04\n(0.24)             1
## 5 mheight   0.2\n(<0.01)  0.05\n(0.21)  0.02\n(0.65)   0.03\n(0.5)
## 6   mppwt  0.22\n(<0.01)  0.05\n(0.18) 0.12\n(<0.01)  -0.03\n(0.5)
## 7     BMI  0.13\n(<0.01)   0.03\n(0.4) 0.12\n(<0.01) -0.05\n(0.21)
## 8 BMI_cat   0.1\n(<0.01)  0.04\n(0.29)  0.03\n(0.42)  0.04\n(0.33)
## 9 SMK_cat -0.21\n(<0.01) -0.07\n(0.05)     0\n(0.95) 0.94\n(<0.01)
##         mheight         mppwt           BMI      BMI_cat SMK_cat
## 1                                                               
## 2                                                               
## 3                                                               
## 4                                                               
## 5             1                                                 
## 6 0.49\n(<0.01)             1                                   
## 7 -0.06\n(0.11) 0.84\n(<0.01)             1                     
## 8 -0.04\n(0.34) 0.68\n(<0.01) 0.81\n(<0.01)            1        
## 9  0.01\n(0.79) -0.07\n(0.07) -0.09\n(0.02) 0.01\n(0.81)       1

7 Check normality of response data

7.1 Birth Weight

We can perform the Shapiro-Wilk test to to test for normality:

shapiro.test(CHDS$bwt)
## 
##  Shapiro-Wilk normality test
## 
## data:  CHDS$bwt
## W = 0.99645, p-value = 0.133

This borderline p-value suggests that more investigation may be needed. A Q-Q plot and boxplot may help:

qqnorm(CHDS$bwt)

boxplot(CHDS$bwt)

These results suggest that birth weight is normally distributed. For completeness we can check the gladder function:

gladder(CHDS$bwt)

None of the transformed distributions look particularly better than the origional distribution.

7.2 Gestational Age

We can perform the Shapiro-Wilk test to to test for normality:

shapiro.test(CHDS$gestwks)
## 
##  Shapiro-Wilk normality test
## 
## data:  CHDS$gestwks
## W = 0.93902, p-value = 4.551e-16
gladder(CHDS$gestwks)

This is very consistent with normal distribution. We will do a Q-Q plot and boxplot for completeness:

qqnorm(CHDS$gestwks)

boxplot(CHDS$gestwks)

These results suggest that gestational age is normally distributed.

8 Preliminary Exploratory analysis

The following code evaluates the the correlation coefficients for each of the variables used in the study:

pcor(CHDS)$estimate
##                  bwt     gestwks          age       mnocig      mheight
## bwt      1.000000000  0.42104338 -0.022113163  0.035428811 -0.008273147
## gestwks  0.421043377  1.00000000  0.013115247 -0.020704678  0.040997469
## age     -0.022113163  0.01311525  1.000000000  0.116230228 -0.005037463
## mnocig   0.035428811 -0.02070468  0.116230228  1.000000000  0.004806437
## mheight -0.008273147  0.04099747 -0.005037463  0.004806437  1.000000000
## mppwt    0.034889198 -0.04703086  0.008245065  0.000565178  0.992472635
## BMI     -0.026128391  0.04181560  0.012905507  0.005292619 -0.985114977
## BMI_cat  0.010126202  0.02693642 -0.117662659  0.010357430 -0.001173643
## SMK_cat -0.098525595  0.02408128 -0.100835398  0.940905720 -0.013587523
##                mppwt          BMI      BMI_cat     SMK_cat
## bwt      0.034889198 -0.026128391  0.010126202 -0.09852560
## gestwks -0.047030858  0.041815595  0.026936421  0.02408128
## age      0.008245065  0.012905507 -0.117662659 -0.10083540
## mnocig   0.000565178  0.005292619  0.010357430  0.94090572
## mheight  0.992472635 -0.985114977 -0.001173643 -0.01358752
## mppwt    1.000000000  0.991423771  0.003685339  0.01053122
## BMI      0.991423771  1.000000000  0.101953565 -0.02251821
## BMI_cat  0.003685339  0.101953565  1.000000000  0.03782960
## SMK_cat  0.010531220 -0.022518208  0.037829597  1.00000000

Note that from the above output that the magnitude of the correlation of bwt with mppwt and BMI are almost as large as the magnitude of the correlation of bwt with mcnocig. Also worth noting is the magnitude of the correlation between bwt and gestwks, although this is not unexpected since birthweight is well documented to be associated with gestation age with much higher risk of low birth weight among pre-term infants.

8.1 Birth-weight and gestation weeks

summary(lm(CHDS$bwt ~ CHDS$gestwks))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$gestwks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1734 -0.6253  0.0266  0.6276  3.3305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.34818    0.80582  -2.914  0.00369 ** 
## CHDS$gestwks  0.24804    0.02024  12.255  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9891 on 678 degrees of freedom
## Multiple R-squared:  0.1814, Adjusted R-squared:  0.1801 
## F-statistic: 150.2 on 1 and 678 DF,  p-value: < 2.2e-16

The output above shows that there is very highly statistically significant evidence (P < 0.001) for a linear relationship between bwt and gestwks.

8.2 Birth-weight and age

summary(lm(CHDS$bwt ~ CHDS$age))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2181 -0.7170  0.0804  0.6851  3.8811 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.5097117  0.2029260  37.007   <2e-16 ***
## CHDS$age    0.0002614  0.0076786   0.034    0.973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.093 on 678 degrees of freedom
## Multiple R-squared:  1.709e-06,  Adjusted R-squared:  -0.001473 
## F-statistic: 0.001159 on 1 and 678 DF,  p-value: 0.9729

The output above shows there is insufficient evidence (P = 0.9729) to suggest a linear relationship between birth-weight and age.

8.3 Birth-weight and cigarettes smoked per day

summary(lm(CHDS$bwt ~ CHDS$mnocig))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mnocig)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3457 -0.7457 -0.0175  0.6976  3.7543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.645663   0.049406 154.751  < 2e-16 ***
## CHDS$mnocig -0.017386   0.003661  -4.749  2.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 678 degrees of freedom
## Multiple R-squared:  0.03219,    Adjusted R-squared:  0.03076 
## F-statistic: 22.55 on 1 and 678 DF,  p-value: 2.502e-06

The output above shows that there is very highly statistically significant evidence (P < 0.001) that there is a linear association between birthweight and number of cigarettes smoked per day by the mother.

8.4 Birth-weight and maternal height

summary(lm(CHDS$bwt ~ CHDS$mheight))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mheight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1778 -0.6833  0.0222  0.7113  4.0113 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.77576    1.06676   1.665   0.0964 .  
## CHDS$mheight  0.08909    0.01654   5.385 9.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 678 degrees of freedom
## Multiple R-squared:  0.04102,    Adjusted R-squared:  0.03961 
## F-statistic:    29 on 1 and 678 DF,  p-value: 9.976e-08

The output above shows that there is very highly statistically significant evidence (P < 0.001) that there is a linear association between birthweight and maternal height.

8.5 Birth-weight and maternal pre-pregnancy weight

summary(lm(CHDS$bwt ~ CHDS$mppwt))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mppwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4210 -0.6892  0.0107  0.6800  3.8415 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.798434   0.293231  19.774  < 2e-16 ***
## CHDS$mppwt  0.013539   0.002288   5.917 5.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.066 on 678 degrees of freedom
## Multiple R-squared:  0.0491, Adjusted R-squared:  0.0477 
## F-statistic: 35.01 on 1 and 678 DF,  p-value: 5.211e-09

The output above shows that there is very highly statistically significant evidence (P < 0.001) for a linear association between birthweight and maternal pre-pregnancy weight.

8.6 Birth-weight and maternal pre-pregnancy BMI

summary(lm(CHDS$bwt ~ CHDS$BMI))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$BMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3693 -0.7058  0.0108  0.7167  3.8015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.38694    0.34253  18.647  < 2e-16 ***
## CHDS$BMI     0.05262    0.01584   3.322 0.000941 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.084 on 678 degrees of freedom
## Multiple R-squared:  0.01602,    Adjusted R-squared:  0.01457 
## F-statistic: 11.04 on 1 and 678 DF,  p-value: 0.0009409

The output above shows that there is very highly statistically significant evidence (P < 0.001) for a linear association between birth-weight and maternal pre-pregnancy BMI.

8.7 Birth-weight and maternal pre-pregnancy BMI with cigarettes smoked as control

summary(lm(CHDS$bwt ~ CHDS$BMI + CHDS$mnocig))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$BMI + CHDS$mnocig)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4842 -0.7097 -0.0094  0.7055  3.6819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.586817   0.340253  19.359  < 2e-16 ***
## CHDS$BMI     0.049132   0.015623   3.145  0.00173 ** 
## CHDS$mnocig -0.016833   0.003642  -4.622 4.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.068 on 677 degrees of freedom
## Multiple R-squared:  0.04612,    Adjusted R-squared:  0.0433 
## F-statistic: 16.37 on 2 and 677 DF,  p-value: 1.144e-07

Including mnocig in the model still provides significant evidence for the BMI correlation coefficient (0.001 < P < 0.01), although the magnitude of the coefficient has decreased slightly, from 0.05262 to 0.049132. The R2 value for the model including both BMI and mnocig in the model is slightly higher than for when BMI is used alone, increasing from 0.01602 to 0.04612.

8.8 Birth-weight and maternal pre-pregnancy weight with cigarettes smoked as control

summary(lm(CHDS$bwt ~ CHDS$mppwt + CHDS$mnocig))
## 
## Call:
## lm(formula = CHDS$bwt ~ CHDS$mppwt + CHDS$mnocig)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5420 -0.6803 -0.0027  0.6549  3.7172 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.958320   0.290739  20.494  < 2e-16 ***
## CHDS$mppwt   0.013265   0.002254   5.885 6.26e-09 ***
## CHDS$mnocig -0.016844   0.003575  -4.712 2.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.05 on 677 degrees of freedom
## Multiple R-squared:  0.07929,    Adjusted R-squared:  0.07657 
## F-statistic: 29.15 on 2 and 677 DF,  p-value: 7.171e-13

When mnocig is included in the model for bwt and mppwt, the magnitude of the coefficient for mppwt decreases slightly (from 0.013539 to 0.013265) and the significance level remains about the same. The R2 value increases a fair amount from 0.0491 to 0.07929, suggesting this model is better able to explain the variance in bwt compared to the model of mppwt alone.

9 Export the modified dataset as a new .csv file

The following .csv file will be used for building the association model and prediction model:

write.csv(CHDS, file = "CHDS2.csv")

10 References:

See literature review section