Import the maternal smoking dataset from the preliminary analysis:
CHDS <- read.csv("CHDS2.csv")
If we let \(Y\) equal birthweight, and \(X\) equal smoking category, the crude model can be conceptualized as:
\[Y = \beta_0 + \beta_1 X + \epsilon\]
Firstly, we will create \(k-1 = 3\) dummy variables to represent the ordinal smoking category data; we can name them \(SMK1\) for category 1, \(SMK2\) for category 3, and \(SMK3\) for category 3. The values of these variables will be assigned to 0 or 1 according to the reference cell coding approach:
If we let:
\[ \begin{split} X_1= \begin{cases} 1 \; \mbox{if in smoking category 1}\\ 0 \; \mbox{otherwise}\\ \end{cases}\\ X_2= \begin{cases} 1 \; \mbox{if in smoking category 2}\\ 0 \; \mbox{otherwise}\\ \end{cases}\\ X_3= \begin{cases} 1 \; \mbox{if in smoking category 3}\\ 0 \; \mbox{otherwise}\\ \end{cases} \end{split} \]
The following R code creates these dummy variables as defined above:
CHDS$SMK1 <- ifelse(CHDS$SMK_cat == 1, 1,0)
CHDS$SMK2 <- ifelse(CHDS$SMK_cat == 2, 1,0)
CHDS$SMK3 <- ifelse(CHDS$SMK_cat == 3, 1,0)
The reformulated crude model can be expressed as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \epsilon\]
To test if a crude association can be established we can consider \(H_0: \beta_1 = \beta_2 = \beta_3 = 0\), and test it at significance \(\alpha = 0.05\) The following R code creates the crude model and reports summary and ANOVA data:
lm.bwcrude <- lm(bwt ~ SMK1 + SMK2 + SMK3, data = CHDS)
summary(lm.bwcrude)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3, data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4328 -0.7328 -0.0328 0.7214 3.6672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.73281 0.05442 142.091 < 2e-16 ***
## SMK1 -0.35223 0.11797 -2.986 0.00293 **
## SMK2 -0.76008 0.14163 -5.367 1.10e-07 ***
## SMK3 -0.46665 0.10790 -4.325 1.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.062 on 676 degrees of freedom
## Multiple R-squared: 0.0585, Adjusted R-squared: 0.05432
## F-statistic: 14 on 3 and 676 DF, p-value: 7.294e-09
anova(lm.bwcrude)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.2415 1.9864 0.1592
## SMK2 1 24.04 24.0434 21.3074 4.682e-06 ***
## SMK3 1 21.11 21.1076 18.7056 1.755e-05 ***
## Residuals 676 762.80 1.1284
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although \(\beta_1\) does not meet our pre-specified criteria of \(\alpha = 0.10\), we will include it in the analysis, as it is a predictor of interest in this study. We can see that all other reported p-values are less than the pre-specified sensitivity, and thus a crude association can be established.
To create the full association model we will include all independent variables in the dataset except for maternal weight, as it is already a linear component of BMI, and will create colinearity without adding significantly to the model. Scientific evidence suggest including these variables may improve association The following R code creates the full model and reports summary and ANOVA data:
lm.bwfull <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age, data = CHDS)
anova(lm.bwfull)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5258 0.112465
## SMK2 1 24.04 24.043 27.0938 2.578e-07 ***
## SMK3 1 21.11 21.108 23.7855 1.346e-06 ***
## gestwks 1 132.20 132.200 148.9717 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5602 0.006128 **
## mheight 1 27.30 27.303 30.7672 4.186e-08 ***
## age 1 0.25 0.249 0.2810 0.596210
## Residuals 672 596.34 0.887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.bwfull)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age, data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2737 -0.6128 0.0214 0.5605 3.2898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.658217 1.256767 -6.094 1.86e-09 ***
## SMK1 -0.176868 0.106465 -1.661 0.09712 .
## SMK2 -0.548731 0.126858 -4.326 1.75e-05 ***
## SMK3 -0.405474 0.096109 -4.219 2.79e-05 ***
## gestwks 0.231123 0.019405 11.910 < 2e-16 ***
## BMI 0.045115 0.014110 3.197 0.00145 **
## mheight 0.081685 0.014698 5.558 3.95e-08 ***
## age -0.003536 0.006670 -0.530 0.59621
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.942 on 672 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2563
## F-statistic: 34.43 on 7 and 672 DF, p-value: < 2.2e-16
We can see that including our four predictors of interest (gestational age, maternal age, maternal height, and maternal BMI) has changed our model parameters, \(\beta_0\) - \(\beta_3\) significantly. Maternal height is not statistically significant, however we will include it as a political variable in the association model.
Let:
\[ \begin{split} X_4 &= \; \mbox{Gestational Age}\\ X_5 &= \; \mbox{Maternal BMI}\\ X_6 &= \; \mbox{Maternal Height}\\ X_7 &= \; \mbox{Maternal Age}\\ \\ \end{split} \]
The full model can be expressed as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 + \epsilon\]
For this step we wish to investigate if \(X_i\) (\(i = 4,...,7\)) are an effect-modifiers of the association between smoking and birth weight. We can construct an association model with an interaction term for each of our dummy variables:
For \(i = 4,...,7\) and \(j = 1...3\):
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 + \beta_{i,j} X_i X_j + \epsilon\]
We wish to test the null hypothesis \(H_0: \beta_{i,j} = 0\) for \(i = 4,...,7\) and \(j = 1...3\). We can construct three linear models to test this hypothesis:
For this step we wish to investigate if gestational age, \(X_4\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):
lm.X1gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + gestwks*as.factor(SMK1), data = CHDS)
summary(lm.X1gestwks)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + gestwks * as.factor(SMK1), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2681 -0.6209 0.0288 0.5673 3.2962
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.494725 1.315598 -5.697 1.83e-08 ***
## SMK1 -1.074384 2.124563 -0.506 0.61324
## SMK2 -0.551152 0.127065 -4.338 1.66e-05 ***
## SMK3 -0.406626 0.096207 -4.227 2.70e-05 ***
## gestwks 0.227582 0.021145 10.763 < 2e-16 ***
## BMI 0.044571 0.014177 3.144 0.00174 **
## mheight 0.081470 0.014716 5.536 4.44e-08 ***
## age -0.003396 0.006683 -0.508 0.61154
## as.factor(SMK1)1 NA NA NA NA
## gestwks:as.factor(SMK1)1 0.022649 0.053547 0.423 0.67245
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9426 on 671 degrees of freedom
## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2554
## F-statistic: 30.11 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X1gestwks)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5228 0.112685
## SMK2 1 24.04 24.043 27.0607 2.622e-07 ***
## SMK3 1 21.11 21.108 23.7564 1.366e-06 ***
## gestwks 1 132.20 132.200 148.7897 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5509 0.006159 **
## mheight 1 27.30 27.303 30.7296 4.266e-08 ***
## age 1 0.25 0.249 0.2807 0.596434
## gestwks:as.factor(SMK1) 1 0.16 0.159 0.1789 0.672446
## Residuals 671 596.18 0.888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.1789 & 0.672446 respectively; thus we can conclude that gestational age is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{4,1} = 0\).
lm.X2gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + gestwks*as.factor(SMK2), data = CHDS)
summary(lm.X2gestwks)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + gestwks * as.factor(SMK2), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2735 -0.6128 0.0213 0.5604 3.2894
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.664274 1.269786 -6.036 2.62e-09 ***
## SMK1 -0.176832 0.106549 -1.660 0.0975 .
## SMK2 -0.444201 3.018649 -0.147 0.8831
## SMK3 -0.405435 0.096187 -4.215 2.84e-05 ***
## gestwks 0.231308 0.020141 11.484 < 2e-16 ***
## BMI 0.045087 0.014144 3.188 0.0015 **
## mheight 0.081673 0.014713 5.551 4.09e-08 ***
## age -0.003534 0.006676 -0.529 0.5967
## as.factor(SMK2)1 NA NA NA NA
## gestwks:as.factor(SMK2)1 -0.002652 0.076507 -0.035 0.9724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2552
## F-statistic: 30.08 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X2gestwks)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5221 0.112733
## SMK2 1 24.04 24.043 27.0536 2.632e-07 ***
## SMK3 1 21.11 21.108 23.7502 1.370e-06 ***
## gestwks 1 132.20 132.200 148.7503 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5489 0.006166 **
## mheight 1 27.30 27.303 30.7215 4.284e-08 ***
## age 1 0.25 0.249 0.2806 0.596483
## gestwks:as.factor(SMK2) 1 0.00 0.001 0.0012 0.972362
## Residuals 671 596.34 0.889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.0012 & 0.972362 respectively; thus we can conclude that gestational age is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{4,2} = 0\).
lm.X3gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + gestwks*as.factor(SMK3), data = CHDS)
summary(lm.X3gestwks)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + gestwks * as.factor(SMK3), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2848 -0.6127 0.0316 0.5625 3.3046
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.402909 1.317431 -5.619 2.81e-08 ***
## SMK1 -0.179144 0.106569 -1.681 0.09322 .
## SMK2 -0.552041 0.127015 -4.346 1.60e-05 ***
## SMK3 -1.628079 1.886550 -0.863 0.38845
## gestwks 0.224559 0.021891 10.258 < 2e-16 ***
## BMI 0.045448 0.014125 3.218 0.00136 **
## mheight 0.081745 0.014705 5.559 3.92e-08 ***
## age -0.003701 0.006678 -0.554 0.57961
## as.factor(SMK3)1 NA NA NA NA
## gestwks:as.factor(SMK3)1 0.030780 0.047434 0.649 0.51662
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9424 on 671 degrees of freedom
## Multiple R-squared: 0.2644, Adjusted R-squared: 0.2556
## F-statistic: 30.15 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X3gestwks)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5237 0.11262
## SMK2 1 24.04 24.043 27.0705 2.610e-07 ***
## SMK3 1 21.11 21.108 23.7650 1.360e-06 ***
## gestwks 1 132.20 132.200 148.8434 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5537 0.00615 **
## mheight 1 27.30 27.303 30.7407 4.243e-08 ***
## age 1 0.25 0.249 0.2808 0.59637
## gestwks:as.factor(SMK3) 1 0.37 0.374 0.4211 0.51662
## Residuals 671 595.97 0.888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.4211 & 0.51662 respectively; thus we can conclude that gestational age is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{4,3} = 0\).
For this step we wish to investigate if BMI, \(X_5\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):
lm.X1BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + BMI*as.factor(SMK1), data = CHDS)
summary(lm.X1BMI)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + BMI * as.factor(SMK1), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2819 -0.6153 0.0209 0.5574 3.2872
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.689798 1.263612 -6.086 1.95e-09 ***
## SMK1 0.066478 0.951413 0.070 0.94432
## SMK2 -0.547677 0.127012 -4.312 1.86e-05 ***
## SMK3 -0.404705 0.096223 -4.206 2.95e-05 ***
## gestwks 0.231643 0.019524 11.865 < 2e-16 ***
## BMI 0.046351 0.014913 3.108 0.00196 **
## mheight 0.081446 0.014738 5.526 4.68e-08 ***
## age -0.003564 0.006676 -0.534 0.59366
## as.factor(SMK1)1 NA NA NA NA
## BMI:as.factor(SMK1)1 -0.011742 0.045619 -0.257 0.79696
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2553
## F-statistic: 30.09 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X1BMI)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5223 0.112715
## SMK2 1 24.04 24.043 27.0562 2.628e-07 ***
## SMK3 1 21.11 21.108 23.7525 1.369e-06 ***
## gestwks 1 132.20 132.200 148.7647 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5497 0.006163 **
## mheight 1 27.30 27.303 30.7245 4.277e-08 ***
## age 1 0.25 0.249 0.2806 0.596465
## BMI:as.factor(SMK1) 1 0.06 0.059 0.0663 0.796955
## Residuals 671 596.28 0.889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.0663 & 0.796955 respectively; thus we can conclude that BMI is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{5,1} = 0\).
lm.X2BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + BMI*as.factor(SMK2), data = CHDS)
summary(lm.X2BMI)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + BMI * as.factor(SMK2), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2625 -0.6188 0.0212 0.5606 3.2894
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.654724 1.257467 -6.087 1.93e-09 ***
## SMK1 -0.178553 0.106572 -1.675 0.09432 .
## SMK2 -1.279164 1.408476 -0.908 0.36410
## SMK3 -0.406082 0.096169 -4.223 2.75e-05 ***
## gestwks 0.231785 0.019457 11.912 < 2e-16 ***
## BMI 0.043468 0.014468 3.005 0.00276 **
## mheight 0.081728 0.014706 5.557 3.95e-08 ***
## age -0.003412 0.006678 -0.511 0.60961
## as.factor(SMK2)1 NA NA NA NA
## BMI:as.factor(SMK2)1 0.034652 0.066547 0.521 0.60274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9425 on 671 degrees of freedom
## Multiple R-squared: 0.2643, Adjusted R-squared: 0.2555
## F-statistic: 30.12 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X2BMI)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5231 0.112660
## SMK2 1 24.04 24.043 27.0644 2.617e-07 ***
## SMK3 1 21.11 21.108 23.7597 1.364e-06 ***
## gestwks 1 132.20 132.200 148.8101 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5520 0.006155 **
## mheight 1 27.30 27.303 30.7338 4.258e-08 ***
## age 1 0.25 0.249 0.2807 0.596409
## BMI:as.factor(SMK2) 1 0.24 0.241 0.2711 0.602736
## Residuals 671 596.10 0.888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.2711 & 0.602736 respectively; thus we can conclude that BMI is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{5,2} = 0\).
lm.X3BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + BMI*as.factor(SMK3), data = CHDS)
summary(lm.X3BMI)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + BMI * as.factor(SMK3), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2447 -0.6129 0.0231 0.5623 3.2885
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.695222 1.273525 -6.042 2.52e-09 ***
## SMK1 -0.175029 0.107006 -1.636 0.10237
## SMK2 -0.547567 0.127105 -4.308 1.89e-05 ***
## SMK3 -0.276780 0.703264 -0.394 0.69403
## gestwks 0.231002 0.019430 11.889 < 2e-16 ***
## BMI 0.046611 0.016276 2.864 0.00432 **
## mheight 0.081830 0.014730 5.556 3.99e-08 ***
## age -0.003542 0.006675 -0.531 0.59588
## as.factor(SMK3)1 NA NA NA NA
## BMI:as.factor(SMK3)1 -0.005997 0.032463 -0.185 0.85350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2552
## F-statistic: 30.08 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X3BMI)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5222 0.112724
## SMK2 1 24.04 24.043 27.0549 2.630e-07 ***
## SMK3 1 21.11 21.108 23.7513 1.369e-06 ***
## gestwks 1 132.20 132.200 148.7576 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5493 0.006165 **
## mheight 1 27.30 27.303 30.7230 4.280e-08 ***
## age 1 0.25 0.249 0.2806 0.596474
## BMI:as.factor(SMK3) 1 0.03 0.030 0.0341 0.853495
## Residuals 671 596.31 0.889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.0341 & 0.853495 respectively; thus we can conclude that BMI is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{5,3} = 0\).
For this step we wish to investigate if maternal height, \(X_6\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):
lm.X1mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + mheight*as.factor(SMK1), data = CHDS)
summary(lm.X1mheight)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + mheight * as.factor(SMK1), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2737 -0.6137 0.0275 0.5575 3.2927
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.750662 1.313783 -5.899 5.78e-09 ***
## SMK1 0.486871 2.729934 0.178 0.85851
## SMK2 -0.547854 0.126998 -4.314 1.85e-05 ***
## SMK3 -0.405854 0.096189 -4.219 2.79e-05 ***
## gestwks 0.231271 0.019429 11.904 < 2e-16 ***
## BMI 0.044981 0.014130 3.183 0.00152 **
## mheight 0.083109 0.015829 5.250 2.04e-07 ***
## age -0.003628 0.006686 -0.543 0.58756
## as.factor(SMK1)1 NA NA NA NA
## mheight:as.factor(SMK1)1 -0.010342 0.042503 -0.243 0.80783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2552
## F-statistic: 30.09 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X1mheight)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5223 0.112717
## SMK2 1 24.04 24.043 27.0559 2.629e-07 ***
## SMK3 1 21.11 21.108 23.7522 1.369e-06 ***
## gestwks 1 132.20 132.200 148.7631 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5496 0.006164 **
## mheight 1 27.30 27.303 30.7241 4.278e-08 ***
## age 1 0.25 0.249 0.2806 0.596467
## mheight:as.factor(SMK1) 1 0.05 0.053 0.0592 0.807833
## Residuals 671 596.29 0.889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.0592 & 0.807833 respectively; thus we can conclude that maternal height is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{6,1} = 0\).
lm.X2mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + mheight*as.factor(SMK2), data = CHDS)
summary(lm.X2mheight)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + mheight * as.factor(SMK2), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2745 -0.6077 0.0200 0.5708 3.2840
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.315029 1.294538 -5.651 2.36e-08 ***
## SMK1 -0.178830 0.106463 -1.680 0.09347 .
## SMK2 -3.968583 3.104099 -1.278 0.20152
## SMK3 -0.403904 0.096104 -4.203 2.99e-05 ***
## gestwks 0.231559 0.019406 11.932 < 2e-16 ***
## BMI 0.045089 0.014107 3.196 0.00146 **
## mheight 0.076246 0.015501 4.919 1.10e-06 ***
## age -0.003892 0.006677 -0.583 0.56019
## as.factor(SMK2)1 NA NA NA NA
## mheight:as.factor(SMK2)1 0.053508 0.048527 1.103 0.27058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9419 on 671 degrees of freedom
## Multiple R-squared: 0.2653, Adjusted R-squared: 0.2565
## F-statistic: 30.28 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X2mheight)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5267 0.11241
## SMK2 1 24.04 24.043 27.1025 2.568e-07 ***
## SMK3 1 21.11 21.108 23.7931 1.341e-06 ***
## gestwks 1 132.20 132.200 149.0195 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5626 0.00612 **
## mheight 1 27.30 27.303 30.7771 4.168e-08 ***
## age 1 0.25 0.249 0.2811 0.59615
## mheight:as.factor(SMK2) 1 1.08 1.079 1.2158 0.27058
## Residuals 671 595.26 0.887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 1.2158 & 0.27058 respectively; thus we can conclude that maternal height is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{6,2} = 0\).
lm.X3mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + mheight*as.factor(SMK3), data = CHDS)
summary(lm.X3mheight)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + mheight * as.factor(SMK3), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2732 -0.6220 0.0071 0.5667 3.3124
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.418121 1.331476 -6.322 4.70e-10 ***
## SMK1 -0.171530 0.106359 -1.613 0.10727
## SMK2 -0.540412 0.126771 -4.263 2.31e-05 ***
## SMK3 3.863973 2.500996 1.545 0.12282
## gestwks 0.231122 0.019378 11.927 < 2e-16 ***
## BMI 0.046369 0.014109 3.287 0.00107 **
## mheight 0.093334 0.016184 5.767 1.23e-08 ***
## age -0.004259 0.006674 -0.638 0.52361
## as.factor(SMK3)1 NA NA NA NA
## mheight:as.factor(SMK3)1 -0.065962 0.038611 -1.708 0.08803 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9407 on 671 degrees of freedom
## Multiple R-squared: 0.2671, Adjusted R-squared: 0.2584
## F-statistic: 30.57 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X3mheight)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5331 0.111955
## SMK2 1 24.04 24.043 27.1712 2.482e-07 ***
## SMK3 1 21.11 21.108 23.8534 1.301e-06 ***
## gestwks 1 132.20 132.200 149.3970 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5818 0.006056 **
## mheight 1 27.30 27.303 30.8550 4.011e-08 ***
## age 1 0.25 0.249 0.2818 0.595686
## mheight:as.factor(SMK3) 1 2.58 2.583 2.9185 0.088033 .
## Residuals 671 593.76 0.885
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 2.9185 & 0.088033 respectively; thus we can conclude that maternal height is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{6,3} = 0\).
For this step we wish to investigate if maternal age, \(X_7\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):
lm.X1age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + age*as.factor(SMK1), data = CHDS)
summary(lm.X1age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + age * as.factor(SMK1), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2957 -0.6116 0.0401 0.5682 3.3162
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.670084 1.256919 -6.102 1.77e-09 ***
## SMK1 -0.596071 0.453477 -1.314 0.18915
## SMK2 -0.548430 0.126867 -4.323 1.77e-05 ***
## SMK3 -0.404847 0.096118 -4.212 2.88e-05 ***
## gestwks 0.231743 0.019418 11.935 < 2e-16 ***
## BMI 0.045660 0.014122 3.233 0.00128 **
## mheight 0.082504 0.014724 5.603 3.07e-08 ***
## age -0.006527 0.007375 -0.885 0.37647
## as.factor(SMK1)1 NA NA NA NA
## age:as.factor(SMK1)1 0.016459 0.017307 0.951 0.34194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9421 on 671 degrees of freedom
## Multiple R-squared: 0.2649, Adjusted R-squared: 0.2562
## F-statistic: 30.23 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X1age)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5255 0.112491
## SMK2 1 24.04 24.043 27.0900 2.584e-07 ***
## SMK3 1 21.11 21.108 23.7821 1.348e-06 ***
## gestwks 1 132.20 132.200 148.9505 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5591 0.006131 **
## mheight 1 27.30 27.303 30.7628 4.197e-08 ***
## age 1 0.25 0.249 0.2810 0.596236
## age:as.factor(SMK1) 1 0.80 0.803 0.9044 0.341945
## Residuals 671 595.54 0.888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.9044 & 0.341945 respectively; thus we can conclude that maternal age is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{7,1} = 0\).
lm.X2age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + age*as.factor(SMK2), data = CHDS)
summary(lm.X2age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + age * as.factor(SMK2), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2731 -0.6127 0.0212 0.5605 3.2893
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.6610091 1.2608296 -6.076 2.06e-09 ***
## SMK1 -0.1768501 0.1065457 -1.660 0.09741 .
## SMK2 -0.5301982 0.6026412 -0.880 0.37929
## SMK3 -0.4054981 0.0961838 -4.216 2.83e-05 ***
## gestwks 0.2311227 0.0194197 11.901 < 2e-16 ***
## BMI 0.0450971 0.0141321 3.191 0.00148 **
## mheight 0.0817076 0.0147261 5.548 4.15e-08 ***
## age -0.0034687 0.0070107 -0.495 0.62092
## as.factor(SMK2)1 NA NA NA NA
## age:as.factor(SMK2)1 -0.0007229 0.0229786 -0.031 0.97491
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2552
## F-statistic: 30.08 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X2age)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5221 0.112733
## SMK2 1 24.04 24.043 27.0536 2.632e-07 ***
## SMK3 1 21.11 21.108 23.7501 1.370e-06 ***
## gestwks 1 132.20 132.200 148.7502 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5489 0.006166 **
## mheight 1 27.30 27.303 30.7215 4.284e-08 ***
## age 1 0.25 0.249 0.2806 0.596483
## age:as.factor(SMK2) 1 0.00 0.001 0.0010 0.974913
## Residuals 671 596.34 0.889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 0.0010 & 0.974913 respectively; thus we can conclude that maternal age is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{7,2} = 0\).
lm.X3age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + age*as.factor(SMK3), data = CHDS)
summary(lm.X3age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age + age * as.factor(SMK3), data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1009 -0.6190 0.0126 0.5565 3.3463
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.550832 1.255677 -6.013 2.99e-09 ***
## SMK1 -0.180078 0.106277 -1.694 0.09065 .
## SMK2 -0.550620 0.126622 -4.349 1.58e-05 ***
## SMK3 -1.279720 0.473495 -2.703 0.00705 **
## gestwks 0.229677 0.019384 11.849 < 2e-16 ***
## BMI 0.044674 0.014085 3.172 0.00158 **
## mheight 0.083307 0.014695 5.669 2.14e-08 ***
## age -0.009111 0.007285 -1.251 0.21148
## as.factor(SMK3)1 NA NA NA NA
## age:as.factor(SMK3)1 0.033517 0.017777 1.885 0.05980 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9402 on 671 degrees of freedom
## Multiple R-squared: 0.2678, Adjusted R-squared: 0.2591
## F-statistic: 30.68 on 8 and 671 DF, p-value: < 2.2e-16
anova(lm.X3age)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.24 2.241 2.5355 0.111786
## SMK2 1 24.04 24.043 27.1968 2.451e-07 ***
## SMK3 1 21.11 21.108 23.8759 1.286e-06 ***
## gestwks 1 132.20 132.200 149.5381 < 2.2e-16 ***
## BMI 1 6.71 6.709 7.5889 0.006032 **
## mheight 1 27.30 27.303 30.8842 3.954e-08 ***
## age 1 0.25 0.249 0.2821 0.595513
## age:as.factor(SMK3) 1 3.14 3.143 3.5550 0.059799 .
## Residuals 671 593.20 0.884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value and associated p-value are 3.5550 & 0.059799 respectively; thus we can conclude that maternal age is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{7,3} = 0\).
Finding no effect-measure modification, we can move on to analysis of confounding:
For each possible confounding variable (gestational age, \(X_4\); BMI, \(X_5\); maternal height, \(X_6\); and maternal age, \(X_7\)) we can create an adjusted model (i.e. the full model above), and a crude model with the potential confounder in question removed. We can then consider change in estimated coefficients for each smoking category and test if the change is greater than 10%:
\[\Delta \hat{\beta_i} = \frac{\hat{\beta_i}_{\; adjusted} - \hat{\beta_i}_{\; crude}}{\hat{\beta_i}_{\; crude}}\]
The following code creates a ‘crude’ model to compare to by removing gestational age from the model:
lm.full_gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + BMI + mheight + age, data = CHDS)
To compute \(\Delta \hat{\beta_1}\):
100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_gestwks)$coefficients[2,1]) / ( summary(lm.full_gestwks)$coefficients[2,1])
## [1] -32.69897
To compute \(\Delta \hat{\beta_2}\):
100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_gestwks)$coefficients[3,1]) / ( summary(lm.full_gestwks)$coefficients[3,1])
## [1] -17.64811
To compute \(\Delta \hat{\beta_3}\):
100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_gestwks)$coefficients[4,1]) / ( summary(lm.full_gestwks)$coefficients[4,1])
## [1] -13.20764
The above analysis shows that removal of gestational age from the model changes \(\hat{\beta_i}\) (for i = 1, 2, 3) by greater than \(10\%\) in each case, and thus gestational age can be considered a confounder of the smoking-birthweight relation.
The following code creates a ‘crude’ model to compare to by removing BMI from the model:
lm.full_BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + mheight + age, data = CHDS)
To compute \(\Delta \hat{\beta_1}\):
100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_BMI)$coefficients[2,1]) / ( summary(lm.full_BMI)$coefficients[2,1])
## [1] -23.73167
To compute \(\Delta \hat{\beta_2}\):
100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_BMI)$coefficients[3,1]) / ( summary(lm.full_BMI)$coefficients[3,1])
## [1] -6.019314
To compute \(\Delta \hat{\beta_3}\):
100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_BMI)$coefficients[4,1]) / ( summary(lm.full_BMI)$coefficients[4,1])
## [1] -4.644929
The above analysis shows that removal of BMI from the model changes \(\hat{\beta_i}\) (for i = 3 only) by greater than \(10\%\), and thus maternal BMI can be considered a confounder of the heavy-smoking-birthweight relation.
The following code creates a ‘crude’ model to compare to by removing maternal height from the model:
lm.full_mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + age, data = CHDS)
To compute \(\Delta \hat{\beta_1}\):
100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_mheight)$coefficients[2,1]) / ( summary(lm.full_mheight)$coefficients[2,1])
## [1] -16.21234
To compute \(\Delta \hat{\beta_2}\):
100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_mheight)$coefficients[3,1]) / ( summary(lm.full_mheight)$coefficients[3,1])
## [1] -9.131777
To compute \(\Delta \hat{\beta_3}\):
100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_mheight)$coefficients[4,1]) / ( summary(lm.full_mheight)$coefficients[4,1])
## [1] 5.077227
The above analysis shows that removal of maternal height from the model changes \(\hat{\beta_i}\) (for i = 1 only) by greater than \(10\%\), and thus maternal height can be considered a confounder of the light-smoking-birthweight relation.
The following code creates a ‘crude’ model to compare to by removing maternal age from the model:
lm.full_age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, data = CHDS)
To compute \(\Delta \hat{\beta_1}\):
100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_age)$coefficients[2,1]) / ( summary(lm.full_age)$coefficients[2,1])
## [1] 0.3031967
To compute \(\Delta \hat{\beta_2}\):
100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_age)$coefficients[3,1]) / ( summary(lm.full_age)$coefficients[3,1])
## [1] 0.06838896
To compute \(\Delta \hat{\beta_3}\):
100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_age)$coefficients[4,1]) / ( summary(lm.full_age)$coefficients[4,1])
## [1] -0.2120098
The above analysis shows that removal of maternal height from the model does not change \(\hat{\beta_i}\) for any i by greater than \(10\%\), and thus maternal height cannot be considered a confounder of the smoking-birthweight relation. It can be included in the full model as a political variable.
The following R code extracts the fitted values, residuals, standardized residuals, and studentized residuals; and plots them so that our regression assumptions can be qualitatively assessed:
fitted <- lm.bwfull$fitted.values # extract fitted values
resm <- resid(lm.bwfull) # extract model residuals
standm <- rstandard(lm.bwfull) # extract standardized residuals
studm <- rstudent(lm.bwfull) # extract studentized residuals
#par(mfrow = c(3,1)) # arranges plots
# residual plot:
#plot(fitted, resm, main = "Plot of Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Residual (lbs)")
#abline(0,0, col = "red")
# standardized residual plot:
plot(fitted, standm, main = "Plot of Standardized Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Standardized Residual")
abline(0,0, col = "red")
# studentized residual plot:
plot(fitted, studm, main = "Plot of Studentized Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Studentized Residual")
abline(0,0, col = "red")
The above plots do not show any overt change in variance with increasing \(\hat{Y}\); suggesting that the homoscedasticity assumption is met. Furthermore the residuals do not seem to be asymmetrically distributed, skewed, or multi-modal in any way; suggesting that the normality assumption is likely met. The reference cell coding approach for creating dummy variables for smoking category ensures that the linearity assumption is met.
Due to the large sample size the Shapiro-Wilk test will not provide helpful insights and thus will not be performed. The following R code creates a QQ Plot:
qqnorm(resm, ylab = "Residuals", main = "Q-Q Plot of Residuals")
qqline(resm)
qqnorm(standm, ylab = "Standardized Residuals", main = "Q-Q Plot of Standardized Residuals")
qqline(standm)
qqnorm(studm, ylab = "Studentized Residuals", main = "Q-Q Plot of Studentized Residuals")
qqline(studm)
Q-Q plots suggest that the normality assumption is met. The handfull of points at the very edges of the plot may or may not be outliers, which brings us to our next section:
The following code adds leverages & Cook’s distance to the dataset:
k <- 8 # number of predictors in association model
h_limit <- 2*(1+k)/nrow(CHDS) # recommended leverage threshold
CHDS$h <- hatvalues(lm.bwfull) # Add leverages to the dataset
CHDS$c <- cooks.distance(lm.bwfull) # add Cook's distance to the dataset
The following code checks if any data points have Cook’s distance, \(c\), greater than 1 or if the leverage, \(h\), is greater than the recommended threshold of \(2(k+1)/n\).
CHDS[CHDS$c>1,]
## [1] X bwt gestwks age mnocig mheight mppwt BMI BMI_cat
## [10] SMK_cat SMK1 SMK2 SMK3 h c
## <0 rows> (or 0-length row.names)
CHDS[CHDS$h>h_limit,]
## X bwt gestwks age mnocig mheight mppwt BMI BMI_cat SMK_cat SMK1
## 9 9 3.3 29 32 0 64 142 24.37158 1 0 0
## 60 60 6.8 48 25 0 66 134 21.62580 1 0 0
## 84 84 6.5 34 38 35 61 108 20.40419 1 3 0
## 111 111 5.3 35 41 7 65 125 20.79882 1 1 1
## 166 166 5.3 33 20 7 63 109 19.30637 1 1 1
## 167 167 8.6 37 25 35 66 246 39.70110 4 3 0
## 226 226 8.8 41 36 12 71 173 24.12597 1 2 0
## 273 273 6.1 44 34 12 59 100 20.19535 1 2 0
## 319 319 7.3 37 40 7 68 134 20.37240 1 1 1
## 356 356 8.1 44 30 2 58 150 31.34661 3 1 1
## 400 400 10.0 38 32 0 67 215 33.67008 3 0 0
## 404 404 7.6 39 26 0 66 210 33.89118 3 0 0
## 445 445 7.3 46 22 2 62 164 29.99272 2 1 1
## 505 505 3.9 33 24 0 58 99 20.68876 1 0 0
## 506 506 4.5 38 25 25 66 200 32.27732 3 3 0
## 610 610 9.1 40 20 0 65 210 34.94201 3 0 0
## 655 655 5.4 38 38 17 63 144 25.50567 2 2 0
## SMK2 SMK3 h c
## 9 0 0 0.05645318 0.0342629837
## 60 0 0 0.03050297 0.0388201182
## 84 0 1 0.03206170 0.0042280228
## 111 0 0 0.03066412 0.0051370087
## 166 0 0 0.02956373 0.0008499634
## 167 0 1 0.08864574 0.0155859567
## 226 1 0 0.03546579 0.0031847055
## 273 1 0 0.03480005 0.0114129996
## 319 0 0 0.02666877 0.0002503806
## 356 0 0 0.05108750 0.0006273953
## 400 0 0 0.03867176 0.0235010285
## 404 0 0 0.03706427 0.0019194355
## 445 0 0 0.04871843 0.0255731624
## 505 0 0 0.03261216 0.0134463740
## 506 0 1 0.03712815 0.0500263277
## 610 0 0 0.04532885 0.0034138906
## 655 1 0 0.02684575 0.0071506225
Upon examination of the data reported in the second table, none of the data is implausible, with the exception of a 48 week gestational age. Of the women in a study of ultrasound-based determination of gestational age (n=1867), none had gestation age greater than 46 weeks. The study further suggests that a 48-week gestational age is well above the 0.1th percentile of gestational age estimates based on date of last period.1
We can create a new dataset with this observation (observation 60) removed:
CHDS_60 <- CHDS[-c(60),]
And a new model can be created:
lm.bwfull_60 <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age, data = CHDS_60)
summary(lm.bwfull_60)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age, data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2771 -0.6106 0.0142 0.5561 3.2634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.100887 1.256253 -6.448 2.16e-10 ***
## SMK1 -0.180649 0.105766 -1.708 0.0881 .
## SMK2 -0.550760 0.126018 -4.370 1.44e-05 ***
## SMK3 -0.410902 0.095487 -4.303 1.93e-05 ***
## gestwks 0.241351 0.019546 12.348 < 2e-16 ***
## BMI 0.044959 0.014016 3.208 0.0014 **
## mheight 0.082446 0.014603 5.646 2.43e-08 ***
## age -0.003674 0.006626 -0.554 0.5795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared: 0.2743, Adjusted R-squared: 0.2667
## F-statistic: 36.23 on 7 and 671 DF, p-value: < 2.2e-16
Upon comparison to the previous full model we can see that most \(\hat{\beta}_i\) have very similar values, with very slight changes in \(\hat{\beta}_0\) & \(\hat{\beta}_4\).
The variance inflation factor approach will be used to assess collinearity. The following R code calculates the factor for each \(\hat{\beta}_i\):
vif(lm.bwfull_60)
## SMK1 SMK2 SMK3 gestwks BMI mheight age
## 1.116177 1.080577 1.094439 1.012350 1.051675 1.018968 1.016212
mean(vif(lm.bwfull_60))
## [1] 1.055771
All factors are well below 10; indicating that collinearity is likely minimal.
If we retain the definition of the \(X_i\) as above, we can write the regression line as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 + \epsilon\]
For the crude model:
confint(lm.bwcrude, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 7.6259530 7.8396639
## SMK1 -0.5838591 -0.1205927
## SMK2 -1.0381668 -0.4819955
## SMK3 -0.6785080 -0.2548011
For the adjusted model:
confint(lm.bwfull_60, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -10.56754782 -5.634226121
## SMK1 -0.38832099 0.027022082
## SMK2 -0.79819783 -0.303322438
## SMK3 -0.59839171 -0.223411760
## gestwks 0.20297239 0.279729815
## BMI 0.01743780 0.072479511
## mheight 0.05377405 0.111118368
## age -0.01668479 0.009337101
The confidence intervals are slightly tighter in the crude model.
For the crude model:
summary(lm.bwcrude)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3, data = CHDS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4328 -0.7328 -0.0328 0.7214 3.6672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.73281 0.05442 142.091 < 2e-16 ***
## SMK1 -0.35223 0.11797 -2.986 0.00293 **
## SMK2 -0.76008 0.14163 -5.367 1.10e-07 ***
## SMK3 -0.46665 0.10790 -4.325 1.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.062 on 676 degrees of freedom
## Multiple R-squared: 0.0585, Adjusted R-squared: 0.05432
## F-statistic: 14 on 3 and 676 DF, p-value: 7.294e-09
For the adjusted model:
summary(lm.bwfull_60)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age, data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2771 -0.6106 0.0142 0.5561 3.2634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.100887 1.256253 -6.448 2.16e-10 ***
## SMK1 -0.180649 0.105766 -1.708 0.0881 .
## SMK2 -0.550760 0.126018 -4.370 1.44e-05 ***
## SMK3 -0.410902 0.095487 -4.303 1.93e-05 ***
## gestwks 0.241351 0.019546 12.348 < 2e-16 ***
## BMI 0.044959 0.014016 3.208 0.0014 **
## mheight 0.082446 0.014603 5.646 2.43e-08 ***
## age -0.003674 0.006626 -0.554 0.5795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared: 0.2743, Adjusted R-squared: 0.2667
## F-statistic: 36.23 on 7 and 671 DF, p-value: < 2.2e-16
The p-values are generally better in the crude model.
To create a parsimonious model we will utilize the backward-elimination procedure starting from the full model above.
The summary function reports the coefficients and associated p-values (right column of table) from the full model:
summary(lm.bwfull_60)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age, data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2771 -0.6106 0.0142 0.5561 3.2634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.100887 1.256253 -6.448 2.16e-10 ***
## SMK1 -0.180649 0.105766 -1.708 0.0881 .
## SMK2 -0.550760 0.126018 -4.370 1.44e-05 ***
## SMK3 -0.410902 0.095487 -4.303 1.93e-05 ***
## gestwks 0.241351 0.019546 12.348 < 2e-16 ***
## BMI 0.044959 0.014016 3.208 0.0014 **
## mheight 0.082446 0.014603 5.646 2.43e-08 ***
## age -0.003674 0.006626 -0.554 0.5795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared: 0.2743, Adjusted R-squared: 0.2667
## F-statistic: 36.23 on 7 and 671 DF, p-value: < 2.2e-16
We can see that maternal age has the highest p-value (0.5795), and that the p-value is well above \(\alpha = 0.10\); thus it will be removed. We can begin the second iteration with the full model excluding maternal age:
The following R code takes the full model and creates a reduced model by removing gestational age. The summary function then reports the coefficients and associated p-values:
lm.bwfull_60_age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, data = CHDS_60)
summary(lm.bwfull_60_age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight,
## data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2487 -0.6078 0.0073 0.5562 3.2309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.16421 1.25041 -6.529 1.30e-10 ***
## SMK1 -0.18009 0.10571 -1.704 0.08890 .
## SMK2 -0.55037 0.12595 -4.370 1.44e-05 ***
## SMK3 -0.41179 0.09542 -4.315 1.83e-05 ***
## gestwks 0.24136 0.01954 12.355 < 2e-16 ***
## BMI 0.04403 0.01391 3.166 0.00162 **
## mheight 0.08226 0.01459 5.638 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared: 0.274, Adjusted R-squared: 0.2675
## F-statistic: 42.26 on 6 and 672 DF, p-value: < 2.2e-16
We can see that all p-values are above our threshold of \(\alpha = 0.10\); thus no further variables will be removed. We can construct the parsimonious model:
If we denote gestational age as \(X_4\), BMI as \(X_5\), & maternal height as \(X_6\) we can state the parsimonious model as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon\]
The following R code reports summary statistics and ANOVA table for this model:
summary(lm.bwfull_60_age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight,
## data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2487 -0.6078 0.0073 0.5562 3.2309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.16421 1.25041 -6.529 1.30e-10 ***
## SMK1 -0.18009 0.10571 -1.704 0.08890 .
## SMK2 -0.55037 0.12595 -4.370 1.44e-05 ***
## SMK3 -0.41179 0.09542 -4.315 1.83e-05 ***
## gestwks 0.24136 0.01954 12.355 < 2e-16 ***
## BMI 0.04403 0.01391 3.166 0.00162 **
## mheight 0.08226 0.01459 5.638 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared: 0.274, Adjusted R-squared: 0.2675
## F-statistic: 42.26 on 6 and 672 DF, p-value: < 2.2e-16
anova(lm.bwfull_60_age)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## SMK1 1 2.28 2.277 2.6030 0.10713
## SMK2 1 24.16 24.158 27.6159 1.989e-07 ***
## SMK3 1 21.32 21.316 24.3672 1.005e-06 ***
## gestwks 1 139.65 139.650 159.6407 < 2.2e-16 ***
## BMI 1 6.63 6.626 7.5744 0.00608 **
## mheight 1 27.80 27.803 31.7825 2.540e-08 ***
## Residuals 672 587.85 0.875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following R code extracts the fitted values, residuals, standardized residuals, and studentized residuals; and plots them so that our regression assumptions can be qualitatively assessed:
fitted <- lm.bwfull_60_age$fitted.values # extract fitted values
resm <- resid(lm.bwfull_60_age) # extract model residuals
standm <- rstandard(lm.bwfull_60_age) # extract standardized residuals
studm <- rstudent(lm.bwfull_60_age) # extract studentized residuals
#par(mfrow = c(3,1)) # arranges plots
# residual plot:
#plot(fitted, resm, main = "Plot of Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Residual (lbs)")
#abline(0,0, col = "red")
# standardized residual plot:
plot(fitted, standm, main = "Plot of Standardized Model Residuals as a Function of Fitted Values for Birth Weight - Parsimonious Model", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Standardized Residual")
abline(0,0, col = "red")
# studentized residual plot:
plot(fitted, studm, main = "Plot of Studentized Model Residuals as a Function of Fitted Values for Birth Weight - Parsimonious Model", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Studentized Residual")
abline(0,0, col = "red")
Similarly to the residual analysis for the association model, the above plots do not show any overt change in variance with increasing \(\hat{Y}\); suggesting that the homoscedasticity assumption is met. Furthermore the residuals do not seem to be asymmetrically distributed, skewed, or multi-modal in any way; suggesting that the normality assumption is likely met. The reference cell coding approach for creating dummy variables for smoking category ensures that the linearity assumption is met.
Due to the large sample size the Shapiro-Wilk test will not provide helpful insights and thus will not be performed. The following R code creates a QQ Plot:
qqnorm(resm, ylab = "Residuals", main = "Q-Q Plot of Residuals - Parsimonious Model")
qqline(resm)
qqnorm(standm, ylab = "Standardized Residuals", main = "Q-Q Plot of Standardized Residuals - Parsimonious Model")
qqline(standm)
qqnorm(studm, ylab = "Studentized Residuals", main = "Q-Q Plot of Studentized Residuals - Parsimonious Model")
qqline(studm)
Similarly to the normality analysis for the association model, the Q-Q plots suggest that the normality assumption is met. The handfull of points at the very edges of the plot may or may not be outliers, which brings us to our next section:
The following code adds leverages & Cook’s distance to the dataset:
k <- 7 # number of predictors in prediction model
h_limit <- 2*(1+k)/nrow(CHDS_60) # recommended leverage threshold
CHDS_60$h <- hatvalues(lm.bwfull_60_age) # Add leverages to the dataset
CHDS_60$c <- cooks.distance(lm.bwfull_60_age)# add Cook's distance to the dataset
The following code checks if any data points have Cook’s distance, \(c\), greater than 1 or if the leverage, \(h\), is greater than the recommended threshold of \(2(k+1)/n\).
CHDS[CHDS_60$c>1,]
## [1] X bwt gestwks age mnocig mheight mppwt BMI BMI_cat
## [10] SMK_cat SMK1 SMK2 SMK3 h c
## <0 rows> (or 0-length row.names)
CHDS[CHDS_60$h>h_limit,]
## X bwt gestwks age mnocig mheight mppwt BMI BMI_cat SMK_cat SMK1 SMK2
## 9 9 3.3 29 32 0 64 142 24.37158 1 0 0 0
## 83 83 7.5 43 26 0 70 150 21.52041 1 0 0 0
## 98 98 8.1 40 28 0 70 125 17.93367 0 0 0 0
## 165 165 6.4 39 19 7 64 118 20.25244 1 1 1 0
## 166 166 5.3 33 20 7 63 109 19.30637 1 1 1 0
## 225 225 7.0 40 36 0 59 140 28.27348 2 0 0 0
## 272 272 6.4 37 21 25 64 124 21.28223 1 3 0 0
## 290 290 6.6 43 25 0 67 133 20.82847 1 0 0 0
## 334 334 6.4 41 28 0 62 132 24.14048 1 0 0 0
## 355 355 5.6 40 32 7 63 105 18.59788 1 1 1 0
## 399 399 6.9 39 18 2 61 108 20.40419 1 1 1 0
## 403 403 7.0 40 26 0 66 140 22.59412 1 0 0 0
## 444 444 8.5 39 17 0 64 118 20.25244 1 0 0 0
## 504 504 7.1 40 20 0 61 104 19.64848 1 0 0 0
## 505 505 3.9 33 24 0 58 99 20.68876 1 0 0 0
## 609 609 8.5 42 23 0 66 180 29.04959 2 0 0 0
## 619 619 7.1 41 21 25 65 120 19.96686 1 3 0 0
## 631 631 7.7 40 27 0 64 118 20.25244 1 0 0 0
## 666 666 8.0 39 20 0 66 125 20.17332 1 0 0 0
## SMK3 h c
## 9 0 0.056453181 3.426298e-02
## 83 0 0.013481686 3.699793e-03
## 98 0 0.012951382 1.361513e-05
## 165 0 0.011889995 1.248300e-03
## 166 0 0.029563732 8.499634e-04
## 225 0 0.021790121 9.858406e-04
## 272 1 0.012073188 7.081683e-05
## 290 0 0.008118873 4.669435e-03
## 334 0 0.005796544 1.790811e-03
## 355 0 0.013631991 5.582888e-03
## 399 0 0.014855471 3.022112e-05
## 403 0 0.003346639 3.889625e-04
## 444 0 0.007231896 1.167893e-03
## 504 0 0.008278008 9.644429e-05
## 505 0 0.032612159 1.344637e-02
## 609 0 0.017839103 7.476289e-05
## 619 1 0.010008126 2.892465e-04
## 631 0 0.003367634 2.158779e-06
## 666 0 0.005644973 1.377047e-04
Upon examination of the data reported in the second table, none of the data is implausible.
The variance inflation factor approach will be used to assess collinearity. The following R code calculates the factor for each \(\hat{\beta}_i\):
vif(lm.bwfull_60_age)
## SMK1 SMK2 SMK3 gestwks BMI mheight
## 1.116076 1.080543 1.094130 1.012350 1.036611 1.018422
mean(vif(lm.bwfull_60_age))
## [1] 1.059689
All factors are well below 10; indicating that collinearity is likely minimal.
If we retain the definition of the \(X_i\) as above, we can write the regression line as:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon\]
For the adjusted association model:
confint(lm.bwfull_60, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -10.56754782 -5.634226121
## SMK1 -0.38832099 0.027022082
## SMK2 -0.79819783 -0.303322438
## SMK3 -0.59839171 -0.223411760
## gestwks 0.20297239 0.279729815
## BMI 0.01743780 0.072479511
## mheight 0.05377405 0.111118368
## age -0.01668479 0.009337101
For the prediction model:
confint(lm.bwfull_60_age, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -10.61938216 -5.70903683
## SMK1 -0.38764413 0.02746481
## SMK2 -0.79767379 -0.30306260
## SMK3 -0.59915690 -0.22442405
## gestwks 0.20300269 0.27972030
## BMI 0.01671973 0.07133750
## mheight 0.05360913 0.11090837
The confidence intervals are quite similar between both models.
For the final association model:
summary(lm.bwfull_60)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight +
## age, data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2771 -0.6106 0.0142 0.5561 3.2634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.100887 1.256253 -6.448 2.16e-10 ***
## SMK1 -0.180649 0.105766 -1.708 0.0881 .
## SMK2 -0.550760 0.126018 -4.370 1.44e-05 ***
## SMK3 -0.410902 0.095487 -4.303 1.93e-05 ***
## gestwks 0.241351 0.019546 12.348 < 2e-16 ***
## BMI 0.044959 0.014016 3.208 0.0014 **
## mheight 0.082446 0.014603 5.646 2.43e-08 ***
## age -0.003674 0.006626 -0.554 0.5795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared: 0.2743, Adjusted R-squared: 0.2667
## F-statistic: 36.23 on 7 and 671 DF, p-value: < 2.2e-16
For the prediction model:
summary(lm.bwfull_60_age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight,
## data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2487 -0.6078 0.0073 0.5562 3.2309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.16421 1.25041 -6.529 1.30e-10 ***
## SMK1 -0.18009 0.10571 -1.704 0.08890 .
## SMK2 -0.55037 0.12595 -4.370 1.44e-05 ***
## SMK3 -0.41179 0.09542 -4.315 1.83e-05 ***
## gestwks 0.24136 0.01954 12.355 < 2e-16 ***
## BMI 0.04403 0.01391 3.166 0.00162 **
## mheight 0.08226 0.01459 5.638 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared: 0.274, Adjusted R-squared: 0.2675
## F-statistic: 42.26 on 6 and 672 DF, p-value: < 2.2e-16
The p-values are generally similar between both models.
Based on our preliminary research, the possible relations between the variables in the dataset are depicted below:
DAG of possible mediators by BMI or gestational age
If BMI truly does influence gestational age then analysis for “intertwined” mediation would be required (e.g. propensity matching). To see if this is required, we can test \(H_0: \beta_1' = 0\), at significance \(\alpha = 0.10\), where:
\[gestational \; age = \beta_0' + \beta_1' \times BMI + \beta_2' \times light \; smoking + \beta_3' \times moderate \; smoking + \beta_4' \times heavy \; smoking + \epsilon\]
The following R code creates a crude linear model and performs the hypothesis test:
lm.ga <- lm(gestwks ~ BMI + SMK1 + SMK2 + SMK3, data = CHDS_60)
summary(lm.ga)
##
## Call:
## lm(formula = gestwks ~ BMI + SMK1 + SMK2 + SMK3, data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9440 -0.9178 0.0979 1.0855 6.3526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.65007 0.60440 65.602 <2e-16 ***
## BMI 0.01206 0.02736 0.441 0.6595
## SMK1 -0.36435 0.20776 -1.754 0.0799 .
## SMK2 -0.51005 0.24699 -2.065 0.0393 *
## SMK3 -0.23840 0.18794 -1.269 0.2051
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.846 on 674 degrees of freedom
## Multiple R-squared: 0.01067, Adjusted R-squared: 0.004795
## F-statistic: 1.817 on 4 and 674 DF, p-value: 0.1238
The p-value is well above our pre-specified significance level, \(\alpha = 0.10\); therefore the influence of BMI on gestational age can be ignored for the purposes of this mediation analysis. Thus we can perform two separate mediation analyses for each proposed mediator. The following DAGs depict the hypothesized relations:
Mediation by BMI:
DAG of possible mediaton by BMI
Mediation by Gestational Age:
DAG of possible mediaton by gestational age
To show that mediation of the smoking-gestational weight relation by BMI is plausible we can test if there is a linear relation between smoking and BMI, and between BMI and gestational age. For smoking and BMI:
lm.BMI <- lm(BMI ~ SMK1 + SMK2 + SMK3 + age, data = CHDS_60)
summary(lm.BMI)
##
## Call:
## lm(formula = BMI ~ SMK1 + SMK2 + SMK3 + age, data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.917 -1.617 -0.351 1.069 18.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.36065 0.48860 41.672 < 2e-16 ***
## SMK1 -1.20267 0.28670 -4.195 3.1e-05 ***
## SMK2 -0.74168 0.34406 -2.156 0.03146 *
## SMK3 -0.46607 0.26212 -1.778 0.07584 .
## age 0.05609 0.01813 3.093 0.00206 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.58 on 674 degrees of freedom
## Multiple R-squared: 0.0432, Adjusted R-squared: 0.03753
## F-statistic: 7.609 on 4 and 674 DF, p-value: 5.347e-06
All smoking categories were found to have a significant linear relation with BMI a pre-specified significance, \(\alpha = 0.10\). We can test BMI and birth weight:
summary(lm.bwfull_60_age)
##
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight,
## data = CHDS_60)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2487 -0.6078 0.0073 0.5562 3.2309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.16421 1.25041 -6.529 1.30e-10 ***
## SMK1 -0.18009 0.10571 -1.704 0.08890 .
## SMK2 -0.55037 0.12595 -4.370 1.44e-05 ***
## SMK3 -0.41179 0.09542 -4.315 1.83e-05 ***
## gestwks 0.24136 0.01954 12.355 < 2e-16 ***
## BMI 0.04403 0.01391 3.166 0.00162 **
## mheight 0.08226 0.01459 5.638 2.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared: 0.274, Adjusted R-squared: 0.2675
## F-statistic: 42.26 on 6 and 672 DF, p-value: < 2.2e-16
The p-value for the linear relation of BMI and gestational weight meets our pre-specified significance, \(\alpha = 0.10\).
set.seed(Sys.time())