1 Import Dataset

Import the maternal smoking dataset from the preliminary analysis:

CHDS <- read.csv("CHDS2.csv")

2 Association Model

2.1 Crude Model

If we let \(Y\) equal birthweight, and \(X\) equal smoking category, the crude model can be conceptualized as:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Firstly, we will create \(k-1 = 3\) dummy variables to represent the ordinal smoking category data; we can name them \(SMK1\) for category 1, \(SMK2\) for category 3, and \(SMK3\) for category 3. The values of these variables will be assigned to 0 or 1 according to the reference cell coding approach:

If we let:

\[ \begin{split} X_1= \begin{cases} 1 \; \mbox{if in smoking category 1}\\ 0 \; \mbox{otherwise}\\ \end{cases}\\ X_2= \begin{cases} 1 \; \mbox{if in smoking category 2}\\ 0 \; \mbox{otherwise}\\ \end{cases}\\ X_3= \begin{cases} 1 \; \mbox{if in smoking category 3}\\ 0 \; \mbox{otherwise}\\ \end{cases} \end{split} \]

The following R code creates these dummy variables as defined above:

CHDS$SMK1 <- ifelse(CHDS$SMK_cat == 1, 1,0)
CHDS$SMK2 <- ifelse(CHDS$SMK_cat == 2, 1,0)
CHDS$SMK3 <- ifelse(CHDS$SMK_cat == 3, 1,0)

The reformulated crude model can be expressed as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \epsilon\]

To test if a crude association can be established we can consider \(H_0: \beta_1 = \beta_2 = \beta_3 = 0\), and test it at significance \(\alpha = 0.05\) The following R code creates the crude model and reports summary and ANOVA data:

lm.bwcrude <- lm(bwt ~ SMK1 + SMK2 + SMK3, data = CHDS)
summary(lm.bwcrude)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3, data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4328 -0.7328 -0.0328  0.7214  3.6672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.73281    0.05442 142.091  < 2e-16 ***
## SMK1        -0.35223    0.11797  -2.986  0.00293 ** 
## SMK2        -0.76008    0.14163  -5.367 1.10e-07 ***
## SMK3        -0.46665    0.10790  -4.325 1.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.062 on 676 degrees of freedom
## Multiple R-squared:  0.0585, Adjusted R-squared:  0.05432 
## F-statistic:    14 on 3 and 676 DF,  p-value: 7.294e-09
anova(lm.bwcrude)
## Analysis of Variance Table
## 
## Response: bwt
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## SMK1        1   2.24  2.2415  1.9864    0.1592    
## SMK2        1  24.04 24.0434 21.3074 4.682e-06 ***
## SMK3        1  21.11 21.1076 18.7056 1.755e-05 ***
## Residuals 676 762.80  1.1284                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although \(\beta_1\) does not meet our pre-specified criteria of \(\alpha = 0.10\), we will include it in the analysis, as it is a predictor of interest in this study. We can see that all other reported p-values are less than the pre-specified sensitivity, and thus a crude association can be established.

2.2 Full Model

To create the full association model we will include all independent variables in the dataset except for maternal weight, as it is already a linear component of BMI, and will create colinearity without adding significantly to the model. Scientific evidence suggest including these variables may improve association The following R code creates the full model and reports summary and ANOVA data:

lm.bwfull <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age, data = CHDS)
anova(lm.bwfull)
## Analysis of Variance Table
## 
## Response: bwt
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1        1   2.24   2.241   2.5258  0.112465    
## SMK2        1  24.04  24.043  27.0938 2.578e-07 ***
## SMK3        1  21.11  21.108  23.7855 1.346e-06 ***
## gestwks     1 132.20 132.200 148.9717 < 2.2e-16 ***
## BMI         1   6.71   6.709   7.5602  0.006128 ** 
## mheight     1  27.30  27.303  30.7672 4.186e-08 ***
## age         1   0.25   0.249   0.2810  0.596210    
## Residuals 672 596.34   0.887                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.bwfull)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age, data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2737 -0.6128  0.0214  0.5605  3.2898 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.658217   1.256767  -6.094 1.86e-09 ***
## SMK1        -0.176868   0.106465  -1.661  0.09712 .  
## SMK2        -0.548731   0.126858  -4.326 1.75e-05 ***
## SMK3        -0.405474   0.096109  -4.219 2.79e-05 ***
## gestwks      0.231123   0.019405  11.910  < 2e-16 ***
## BMI          0.045115   0.014110   3.197  0.00145 ** 
## mheight      0.081685   0.014698   5.558 3.95e-08 ***
## age         -0.003536   0.006670  -0.530  0.59621    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.942 on 672 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2563 
## F-statistic: 34.43 on 7 and 672 DF,  p-value: < 2.2e-16

We can see that including our four predictors of interest (gestational age, maternal age, maternal height, and maternal BMI) has changed our model parameters, \(\beta_0\) - \(\beta_3\) significantly. Maternal height is not statistically significant, however we will include it as a political variable in the association model.

Let:

\[ \begin{split} X_4 &= \; \mbox{Gestational Age}\\ X_5 &= \; \mbox{Maternal BMI}\\ X_6 &= \; \mbox{Maternal Height}\\ X_7 &= \; \mbox{Maternal Age}\\ \\ \end{split} \]

The full model can be expressed as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 + \epsilon\]

2.3 Analysis of Effect Modification

For this step we wish to investigate if \(X_i\) (\(i = 4,...,7\)) are an effect-modifiers of the association between smoking and birth weight. We can construct an association model with an interaction term for each of our dummy variables:

For \(i = 4,...,7\) and \(j = 1...3\):

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 + \beta_{i,j} X_i X_j + \epsilon\]

We wish to test the null hypothesis \(H_0: \beta_{i,j} = 0\) for \(i = 4,...,7\) and \(j = 1...3\). We can construct three linear models to test this hypothesis:

2.3.1 Gestational Age

For this step we wish to investigate if gestational age, \(X_4\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):

2.3.1.1 Light Smoking

lm.X1gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + gestwks*as.factor(SMK1), data = CHDS)
summary(lm.X1gestwks)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + gestwks * as.factor(SMK1), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2681 -0.6209  0.0288  0.5673  3.2962 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.494725   1.315598  -5.697 1.83e-08 ***
## SMK1                     -1.074384   2.124563  -0.506  0.61324    
## SMK2                     -0.551152   0.127065  -4.338 1.66e-05 ***
## SMK3                     -0.406626   0.096207  -4.227 2.70e-05 ***
## gestwks                   0.227582   0.021145  10.763  < 2e-16 ***
## BMI                       0.044571   0.014177   3.144  0.00174 ** 
## mheight                   0.081470   0.014716   5.536 4.44e-08 ***
## age                      -0.003396   0.006683  -0.508  0.61154    
## as.factor(SMK1)1                NA         NA      NA       NA    
## gestwks:as.factor(SMK1)1  0.022649   0.053547   0.423  0.67245    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9426 on 671 degrees of freedom
## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2554 
## F-statistic: 30.11 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X1gestwks)
## Analysis of Variance Table
## 
## Response: bwt
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                      1   2.24   2.241   2.5228  0.112685    
## SMK2                      1  24.04  24.043  27.0607 2.622e-07 ***
## SMK3                      1  21.11  21.108  23.7564 1.366e-06 ***
## gestwks                   1 132.20 132.200 148.7897 < 2.2e-16 ***
## BMI                       1   6.71   6.709   7.5509  0.006159 ** 
## mheight                   1  27.30  27.303  30.7296 4.266e-08 ***
## age                       1   0.25   0.249   0.2807  0.596434    
## gestwks:as.factor(SMK1)   1   0.16   0.159   0.1789  0.672446    
## Residuals               671 596.18   0.888                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.1789 & 0.672446 respectively; thus we can conclude that gestational age is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{4,1} = 0\).

2.3.1.2 Moderate Smoking

lm.X2gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + gestwks*as.factor(SMK2), data = CHDS)
summary(lm.X2gestwks)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + gestwks * as.factor(SMK2), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2735 -0.6128  0.0213  0.5604  3.2894 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.664274   1.269786  -6.036 2.62e-09 ***
## SMK1                     -0.176832   0.106549  -1.660   0.0975 .  
## SMK2                     -0.444201   3.018649  -0.147   0.8831    
## SMK3                     -0.405435   0.096187  -4.215 2.84e-05 ***
## gestwks                   0.231308   0.020141  11.484  < 2e-16 ***
## BMI                       0.045087   0.014144   3.188   0.0015 ** 
## mheight                   0.081673   0.014713   5.551 4.09e-08 ***
## age                      -0.003534   0.006676  -0.529   0.5967    
## as.factor(SMK2)1                NA         NA      NA       NA    
## gestwks:as.factor(SMK2)1 -0.002652   0.076507  -0.035   0.9724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2552 
## F-statistic: 30.08 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X2gestwks)
## Analysis of Variance Table
## 
## Response: bwt
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                      1   2.24   2.241   2.5221  0.112733    
## SMK2                      1  24.04  24.043  27.0536 2.632e-07 ***
## SMK3                      1  21.11  21.108  23.7502 1.370e-06 ***
## gestwks                   1 132.20 132.200 148.7503 < 2.2e-16 ***
## BMI                       1   6.71   6.709   7.5489  0.006166 ** 
## mheight                   1  27.30  27.303  30.7215 4.284e-08 ***
## age                       1   0.25   0.249   0.2806  0.596483    
## gestwks:as.factor(SMK2)   1   0.00   0.001   0.0012  0.972362    
## Residuals               671 596.34   0.889                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.0012 & 0.972362 respectively; thus we can conclude that gestational age is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{4,2} = 0\).

2.3.1.3 Heavy Smoking

lm.X3gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + gestwks*as.factor(SMK3), data = CHDS)
summary(lm.X3gestwks)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + gestwks * as.factor(SMK3), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2848 -0.6127  0.0316  0.5625  3.3046 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.402909   1.317431  -5.619 2.81e-08 ***
## SMK1                     -0.179144   0.106569  -1.681  0.09322 .  
## SMK2                     -0.552041   0.127015  -4.346 1.60e-05 ***
## SMK3                     -1.628079   1.886550  -0.863  0.38845    
## gestwks                   0.224559   0.021891  10.258  < 2e-16 ***
## BMI                       0.045448   0.014125   3.218  0.00136 ** 
## mheight                   0.081745   0.014705   5.559 3.92e-08 ***
## age                      -0.003701   0.006678  -0.554  0.57961    
## as.factor(SMK3)1                NA         NA      NA       NA    
## gestwks:as.factor(SMK3)1  0.030780   0.047434   0.649  0.51662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9424 on 671 degrees of freedom
## Multiple R-squared:  0.2644, Adjusted R-squared:  0.2556 
## F-statistic: 30.15 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X3gestwks)
## Analysis of Variance Table
## 
## Response: bwt
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                      1   2.24   2.241   2.5237   0.11262    
## SMK2                      1  24.04  24.043  27.0705 2.610e-07 ***
## SMK3                      1  21.11  21.108  23.7650 1.360e-06 ***
## gestwks                   1 132.20 132.200 148.8434 < 2.2e-16 ***
## BMI                       1   6.71   6.709   7.5537   0.00615 ** 
## mheight                   1  27.30  27.303  30.7407 4.243e-08 ***
## age                       1   0.25   0.249   0.2808   0.59637    
## gestwks:as.factor(SMK3)   1   0.37   0.374   0.4211   0.51662    
## Residuals               671 595.97   0.888                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.4211 & 0.51662 respectively; thus we can conclude that gestational age is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{4,3} = 0\).

2.3.2 BMI

For this step we wish to investigate if BMI, \(X_5\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):

2.3.2.1 Light Smoking

lm.X1BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + BMI*as.factor(SMK1), data = CHDS)
summary(lm.X1BMI)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + BMI * as.factor(SMK1), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2819 -0.6153  0.0209  0.5574  3.2872 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.689798   1.263612  -6.086 1.95e-09 ***
## SMK1                  0.066478   0.951413   0.070  0.94432    
## SMK2                 -0.547677   0.127012  -4.312 1.86e-05 ***
## SMK3                 -0.404705   0.096223  -4.206 2.95e-05 ***
## gestwks               0.231643   0.019524  11.865  < 2e-16 ***
## BMI                   0.046351   0.014913   3.108  0.00196 ** 
## mheight               0.081446   0.014738   5.526 4.68e-08 ***
## age                  -0.003564   0.006676  -0.534  0.59366    
## as.factor(SMK1)1            NA         NA      NA       NA    
## BMI:as.factor(SMK1)1 -0.011742   0.045619  -0.257  0.79696    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2553 
## F-statistic: 30.09 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X1BMI)
## Analysis of Variance Table
## 
## Response: bwt
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                  1   2.24   2.241   2.5223  0.112715    
## SMK2                  1  24.04  24.043  27.0562 2.628e-07 ***
## SMK3                  1  21.11  21.108  23.7525 1.369e-06 ***
## gestwks               1 132.20 132.200 148.7647 < 2.2e-16 ***
## BMI                   1   6.71   6.709   7.5497  0.006163 ** 
## mheight               1  27.30  27.303  30.7245 4.277e-08 ***
## age                   1   0.25   0.249   0.2806  0.596465    
## BMI:as.factor(SMK1)   1   0.06   0.059   0.0663  0.796955    
## Residuals           671 596.28   0.889                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.0663 & 0.796955 respectively; thus we can conclude that BMI is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{5,1} = 0\).

2.3.2.2 Moderate Smoking

lm.X2BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + BMI*as.factor(SMK2), data = CHDS)
summary(lm.X2BMI)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + BMI * as.factor(SMK2), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2625 -0.6188  0.0212  0.5606  3.2894 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.654724   1.257467  -6.087 1.93e-09 ***
## SMK1                 -0.178553   0.106572  -1.675  0.09432 .  
## SMK2                 -1.279164   1.408476  -0.908  0.36410    
## SMK3                 -0.406082   0.096169  -4.223 2.75e-05 ***
## gestwks               0.231785   0.019457  11.912  < 2e-16 ***
## BMI                   0.043468   0.014468   3.005  0.00276 ** 
## mheight               0.081728   0.014706   5.557 3.95e-08 ***
## age                  -0.003412   0.006678  -0.511  0.60961    
## as.factor(SMK2)1            NA         NA      NA       NA    
## BMI:as.factor(SMK2)1  0.034652   0.066547   0.521  0.60274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9425 on 671 degrees of freedom
## Multiple R-squared:  0.2643, Adjusted R-squared:  0.2555 
## F-statistic: 30.12 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X2BMI)
## Analysis of Variance Table
## 
## Response: bwt
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                  1   2.24   2.241   2.5231  0.112660    
## SMK2                  1  24.04  24.043  27.0644 2.617e-07 ***
## SMK3                  1  21.11  21.108  23.7597 1.364e-06 ***
## gestwks               1 132.20 132.200 148.8101 < 2.2e-16 ***
## BMI                   1   6.71   6.709   7.5520  0.006155 ** 
## mheight               1  27.30  27.303  30.7338 4.258e-08 ***
## age                   1   0.25   0.249   0.2807  0.596409    
## BMI:as.factor(SMK2)   1   0.24   0.241   0.2711  0.602736    
## Residuals           671 596.10   0.888                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.2711 & 0.602736 respectively; thus we can conclude that BMI is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{5,2} = 0\).

2.3.2.3 Heavy Smoking

lm.X3BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + BMI*as.factor(SMK3), data = CHDS)
summary(lm.X3BMI)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + BMI * as.factor(SMK3), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2447 -0.6129  0.0231  0.5623  3.2885 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.695222   1.273525  -6.042 2.52e-09 ***
## SMK1                 -0.175029   0.107006  -1.636  0.10237    
## SMK2                 -0.547567   0.127105  -4.308 1.89e-05 ***
## SMK3                 -0.276780   0.703264  -0.394  0.69403    
## gestwks               0.231002   0.019430  11.889  < 2e-16 ***
## BMI                   0.046611   0.016276   2.864  0.00432 ** 
## mheight               0.081830   0.014730   5.556 3.99e-08 ***
## age                  -0.003542   0.006675  -0.531  0.59588    
## as.factor(SMK3)1            NA         NA      NA       NA    
## BMI:as.factor(SMK3)1 -0.005997   0.032463  -0.185  0.85350    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2552 
## F-statistic: 30.08 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X3BMI)
## Analysis of Variance Table
## 
## Response: bwt
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                  1   2.24   2.241   2.5222  0.112724    
## SMK2                  1  24.04  24.043  27.0549 2.630e-07 ***
## SMK3                  1  21.11  21.108  23.7513 1.369e-06 ***
## gestwks               1 132.20 132.200 148.7576 < 2.2e-16 ***
## BMI                   1   6.71   6.709   7.5493  0.006165 ** 
## mheight               1  27.30  27.303  30.7230 4.280e-08 ***
## age                   1   0.25   0.249   0.2806  0.596474    
## BMI:as.factor(SMK3)   1   0.03   0.030   0.0341  0.853495    
## Residuals           671 596.31   0.889                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.0341 & 0.853495 respectively; thus we can conclude that BMI is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{5,3} = 0\).

2.3.3 Maternal Height

For this step we wish to investigate if maternal height, \(X_6\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):

2.3.3.1 Light Smoking

lm.X1mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + mheight*as.factor(SMK1), data = CHDS)
summary(lm.X1mheight)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + mheight * as.factor(SMK1), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2737 -0.6137  0.0275  0.5575  3.2927 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.750662   1.313783  -5.899 5.78e-09 ***
## SMK1                      0.486871   2.729934   0.178  0.85851    
## SMK2                     -0.547854   0.126998  -4.314 1.85e-05 ***
## SMK3                     -0.405854   0.096189  -4.219 2.79e-05 ***
## gestwks                   0.231271   0.019429  11.904  < 2e-16 ***
## BMI                       0.044981   0.014130   3.183  0.00152 ** 
## mheight                   0.083109   0.015829   5.250 2.04e-07 ***
## age                      -0.003628   0.006686  -0.543  0.58756    
## as.factor(SMK1)1                NA         NA      NA       NA    
## mheight:as.factor(SMK1)1 -0.010342   0.042503  -0.243  0.80783    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2552 
## F-statistic: 30.09 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X1mheight)
## Analysis of Variance Table
## 
## Response: bwt
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                      1   2.24   2.241   2.5223  0.112717    
## SMK2                      1  24.04  24.043  27.0559 2.629e-07 ***
## SMK3                      1  21.11  21.108  23.7522 1.369e-06 ***
## gestwks                   1 132.20 132.200 148.7631 < 2.2e-16 ***
## BMI                       1   6.71   6.709   7.5496  0.006164 ** 
## mheight                   1  27.30  27.303  30.7241 4.278e-08 ***
## age                       1   0.25   0.249   0.2806  0.596467    
## mheight:as.factor(SMK1)   1   0.05   0.053   0.0592  0.807833    
## Residuals               671 596.29   0.889                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.0592 & 0.807833 respectively; thus we can conclude that maternal height is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{6,1} = 0\).

2.3.3.2 Moderate Smoking

lm.X2mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + mheight*as.factor(SMK2), data = CHDS)
summary(lm.X2mheight)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + mheight * as.factor(SMK2), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2745 -0.6077  0.0200  0.5708  3.2840 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.315029   1.294538  -5.651 2.36e-08 ***
## SMK1                     -0.178830   0.106463  -1.680  0.09347 .  
## SMK2                     -3.968583   3.104099  -1.278  0.20152    
## SMK3                     -0.403904   0.096104  -4.203 2.99e-05 ***
## gestwks                   0.231559   0.019406  11.932  < 2e-16 ***
## BMI                       0.045089   0.014107   3.196  0.00146 ** 
## mheight                   0.076246   0.015501   4.919 1.10e-06 ***
## age                      -0.003892   0.006677  -0.583  0.56019    
## as.factor(SMK2)1                NA         NA      NA       NA    
## mheight:as.factor(SMK2)1  0.053508   0.048527   1.103  0.27058    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9419 on 671 degrees of freedom
## Multiple R-squared:  0.2653, Adjusted R-squared:  0.2565 
## F-statistic: 30.28 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X2mheight)
## Analysis of Variance Table
## 
## Response: bwt
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                      1   2.24   2.241   2.5267   0.11241    
## SMK2                      1  24.04  24.043  27.1025 2.568e-07 ***
## SMK3                      1  21.11  21.108  23.7931 1.341e-06 ***
## gestwks                   1 132.20 132.200 149.0195 < 2.2e-16 ***
## BMI                       1   6.71   6.709   7.5626   0.00612 ** 
## mheight                   1  27.30  27.303  30.7771 4.168e-08 ***
## age                       1   0.25   0.249   0.2811   0.59615    
## mheight:as.factor(SMK2)   1   1.08   1.079   1.2158   0.27058    
## Residuals               671 595.26   0.887                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 1.2158 & 0.27058 respectively; thus we can conclude that maternal height is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{6,2} = 0\).

2.3.3.3 Heavy Smoking

lm.X3mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + mheight*as.factor(SMK3), data = CHDS)
summary(lm.X3mheight)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + mheight * as.factor(SMK3), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2732 -0.6220  0.0071  0.5667  3.3124 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -8.418121   1.331476  -6.322 4.70e-10 ***
## SMK1                     -0.171530   0.106359  -1.613  0.10727    
## SMK2                     -0.540412   0.126771  -4.263 2.31e-05 ***
## SMK3                      3.863973   2.500996   1.545  0.12282    
## gestwks                   0.231122   0.019378  11.927  < 2e-16 ***
## BMI                       0.046369   0.014109   3.287  0.00107 ** 
## mheight                   0.093334   0.016184   5.767 1.23e-08 ***
## age                      -0.004259   0.006674  -0.638  0.52361    
## as.factor(SMK3)1                NA         NA      NA       NA    
## mheight:as.factor(SMK3)1 -0.065962   0.038611  -1.708  0.08803 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9407 on 671 degrees of freedom
## Multiple R-squared:  0.2671, Adjusted R-squared:  0.2584 
## F-statistic: 30.57 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X3mheight)
## Analysis of Variance Table
## 
## Response: bwt
##                          Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                      1   2.24   2.241   2.5331  0.111955    
## SMK2                      1  24.04  24.043  27.1712 2.482e-07 ***
## SMK3                      1  21.11  21.108  23.8534 1.301e-06 ***
## gestwks                   1 132.20 132.200 149.3970 < 2.2e-16 ***
## BMI                       1   6.71   6.709   7.5818  0.006056 ** 
## mheight                   1  27.30  27.303  30.8550 4.011e-08 ***
## age                       1   0.25   0.249   0.2818  0.595686    
## mheight:as.factor(SMK3)   1   2.58   2.583   2.9185  0.088033 .  
## Residuals               671 593.76   0.885                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 2.9185 & 0.088033 respectively; thus we can conclude that maternal height is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{6,3} = 0\).

2.3.4 Maternal Age

For this step we wish to investigate if maternal age, \(X_7\), is an effect-modifier which changes the association between smoking and birth weight. We will test three models for \(X_1\), \(X_2\), & \(X_3\):

2.3.4.1 Light Smoking

lm.X1age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + age*as.factor(SMK1), data = CHDS)
summary(lm.X1age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + age * as.factor(SMK1), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2957 -0.6116  0.0401  0.5682  3.3162 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.670084   1.256919  -6.102 1.77e-09 ***
## SMK1                 -0.596071   0.453477  -1.314  0.18915    
## SMK2                 -0.548430   0.126867  -4.323 1.77e-05 ***
## SMK3                 -0.404847   0.096118  -4.212 2.88e-05 ***
## gestwks               0.231743   0.019418  11.935  < 2e-16 ***
## BMI                   0.045660   0.014122   3.233  0.00128 ** 
## mheight               0.082504   0.014724   5.603 3.07e-08 ***
## age                  -0.006527   0.007375  -0.885  0.37647    
## as.factor(SMK1)1            NA         NA      NA       NA    
## age:as.factor(SMK1)1  0.016459   0.017307   0.951  0.34194    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9421 on 671 degrees of freedom
## Multiple R-squared:  0.2649, Adjusted R-squared:  0.2562 
## F-statistic: 30.23 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X1age)
## Analysis of Variance Table
## 
## Response: bwt
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                  1   2.24   2.241   2.5255  0.112491    
## SMK2                  1  24.04  24.043  27.0900 2.584e-07 ***
## SMK3                  1  21.11  21.108  23.7821 1.348e-06 ***
## gestwks               1 132.20 132.200 148.9505 < 2.2e-16 ***
## BMI                   1   6.71   6.709   7.5591  0.006131 ** 
## mheight               1  27.30  27.303  30.7628 4.197e-08 ***
## age                   1   0.25   0.249   0.2810  0.596236    
## age:as.factor(SMK1)   1   0.80   0.803   0.9044  0.341945    
## Residuals           671 595.54   0.888                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.9044 & 0.341945 respectively; thus we can conclude that maternal age is not an effect-measure modifier of light smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{7,1} = 0\).

2.3.4.2 Moderate Smoking

lm.X2age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + age*as.factor(SMK2), data = CHDS)
summary(lm.X2age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + age * as.factor(SMK2), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2731 -0.6127  0.0212  0.5605  3.2893 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.6610091  1.2608296  -6.076 2.06e-09 ***
## SMK1                 -0.1768501  0.1065457  -1.660  0.09741 .  
## SMK2                 -0.5301982  0.6026412  -0.880  0.37929    
## SMK3                 -0.4054981  0.0961838  -4.216 2.83e-05 ***
## gestwks               0.2311227  0.0194197  11.901  < 2e-16 ***
## BMI                   0.0450971  0.0141321   3.191  0.00148 ** 
## mheight               0.0817076  0.0147261   5.548 4.15e-08 ***
## age                  -0.0034687  0.0070107  -0.495  0.62092    
## as.factor(SMK2)1             NA         NA      NA       NA    
## age:as.factor(SMK2)1 -0.0007229  0.0229786  -0.031  0.97491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9427 on 671 degrees of freedom
## Multiple R-squared:  0.264,  Adjusted R-squared:  0.2552 
## F-statistic: 30.08 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X2age)
## Analysis of Variance Table
## 
## Response: bwt
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                  1   2.24   2.241   2.5221  0.112733    
## SMK2                  1  24.04  24.043  27.0536 2.632e-07 ***
## SMK3                  1  21.11  21.108  23.7501 1.370e-06 ***
## gestwks               1 132.20 132.200 148.7502 < 2.2e-16 ***
## BMI                   1   6.71   6.709   7.5489  0.006166 ** 
## mheight               1  27.30  27.303  30.7215 4.284e-08 ***
## age                   1   0.25   0.249   0.2806  0.596483    
## age:as.factor(SMK2)   1   0.00   0.001   0.0010  0.974913    
## Residuals           671 596.34   0.889                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 0.0010 & 0.974913 respectively; thus we can conclude that maternal age is not an effect-measure modifier of moderate smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{7,2} = 0\).

2.3.4.3 Heavy Smoking

lm.X3age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age + age*as.factor(SMK3), data = CHDS)
summary(lm.X3age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age + age * as.factor(SMK3), data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1009 -0.6190  0.0126  0.5565  3.3463 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.550832   1.255677  -6.013 2.99e-09 ***
## SMK1                 -0.180078   0.106277  -1.694  0.09065 .  
## SMK2                 -0.550620   0.126622  -4.349 1.58e-05 ***
## SMK3                 -1.279720   0.473495  -2.703  0.00705 ** 
## gestwks               0.229677   0.019384  11.849  < 2e-16 ***
## BMI                   0.044674   0.014085   3.172  0.00158 ** 
## mheight               0.083307   0.014695   5.669 2.14e-08 ***
## age                  -0.009111   0.007285  -1.251  0.21148    
## as.factor(SMK3)1            NA         NA      NA       NA    
## age:as.factor(SMK3)1  0.033517   0.017777   1.885  0.05980 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9402 on 671 degrees of freedom
## Multiple R-squared:  0.2678, Adjusted R-squared:  0.2591 
## F-statistic: 30.68 on 8 and 671 DF,  p-value: < 2.2e-16
anova(lm.X3age)
## Analysis of Variance Table
## 
## Response: bwt
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1                  1   2.24   2.241   2.5355  0.111786    
## SMK2                  1  24.04  24.043  27.1968 2.451e-07 ***
## SMK3                  1  21.11  21.108  23.8759 1.286e-06 ***
## gestwks               1 132.20 132.200 149.5381 < 2.2e-16 ***
## BMI                   1   6.71   6.709   7.5889  0.006032 ** 
## mheight               1  27.30  27.303  30.8842 3.954e-08 ***
## age                   1   0.25   0.249   0.2821  0.595513    
## age:as.factor(SMK3)   1   3.14   3.143   3.5550  0.059799 .  
## Residuals           671 593.20   0.884                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value and associated p-value are 3.5550 & 0.059799 respectively; thus we can conclude that maternal age is not an effect-measure modifier of heavy smoking, at pre-defined significance \(\alpha = 0.05\); the observed results are not inconsistent with \(\beta_{7,3} = 0\).

2.3.5 Conclusion

Finding no effect-measure modification, we can move on to analysis of confounding:

2.4 Analysis of Confounding

For each possible confounding variable (gestational age, \(X_4\); BMI, \(X_5\); maternal height, \(X_6\); and maternal age, \(X_7\)) we can create an adjusted model (i.e. the full model above), and a crude model with the potential confounder in question removed. We can then consider change in estimated coefficients for each smoking category and test if the change is greater than 10%:

\[\Delta \hat{\beta_i} = \frac{\hat{\beta_i}_{\; adjusted} - \hat{\beta_i}_{\; crude}}{\hat{\beta_i}_{\; crude}}\]

2.4.1 Gestational Age

The following code creates a ‘crude’ model to compare to by removing gestational age from the model:

lm.full_gestwks <- lm(bwt ~ SMK1 + SMK2 + SMK3 + BMI + mheight + age, data = CHDS)

To compute \(\Delta \hat{\beta_1}\):

100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_gestwks)$coefficients[2,1]) / ( summary(lm.full_gestwks)$coefficients[2,1])
## [1] -32.69897

To compute \(\Delta \hat{\beta_2}\):

100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_gestwks)$coefficients[3,1]) / ( summary(lm.full_gestwks)$coefficients[3,1])
## [1] -17.64811

To compute \(\Delta \hat{\beta_3}\):

100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_gestwks)$coefficients[4,1]) / ( summary(lm.full_gestwks)$coefficients[4,1])
## [1] -13.20764

The above analysis shows that removal of gestational age from the model changes \(\hat{\beta_i}\) (for i = 1, 2, 3) by greater than \(10\%\) in each case, and thus gestational age can be considered a confounder of the smoking-birthweight relation.

2.4.2 BMI

The following code creates a ‘crude’ model to compare to by removing BMI from the model:

lm.full_BMI <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + mheight + age, data = CHDS)

To compute \(\Delta \hat{\beta_1}\):

100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_BMI)$coefficients[2,1]) / ( summary(lm.full_BMI)$coefficients[2,1])
## [1] -23.73167

To compute \(\Delta \hat{\beta_2}\):

100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_BMI)$coefficients[3,1]) / ( summary(lm.full_BMI)$coefficients[3,1])
## [1] -6.019314

To compute \(\Delta \hat{\beta_3}\):

100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_BMI)$coefficients[4,1]) / ( summary(lm.full_BMI)$coefficients[4,1])
## [1] -4.644929

The above analysis shows that removal of BMI from the model changes \(\hat{\beta_i}\) (for i = 3 only) by greater than \(10\%\), and thus maternal BMI can be considered a confounder of the heavy-smoking-birthweight relation.

2.4.3 Maternal Height

The following code creates a ‘crude’ model to compare to by removing maternal height from the model:

lm.full_mheight <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + age, data = CHDS)

To compute \(\Delta \hat{\beta_1}\):

100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_mheight)$coefficients[2,1]) / ( summary(lm.full_mheight)$coefficients[2,1])
## [1] -16.21234

To compute \(\Delta \hat{\beta_2}\):

100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_mheight)$coefficients[3,1]) / ( summary(lm.full_mheight)$coefficients[3,1])
## [1] -9.131777

To compute \(\Delta \hat{\beta_3}\):

100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_mheight)$coefficients[4,1]) / ( summary(lm.full_mheight)$coefficients[4,1])
## [1] 5.077227

The above analysis shows that removal of maternal height from the model changes \(\hat{\beta_i}\) (for i = 1 only) by greater than \(10\%\), and thus maternal height can be considered a confounder of the light-smoking-birthweight relation.

2.4.4 Maternal Age

The following code creates a ‘crude’ model to compare to by removing maternal age from the model:

lm.full_age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, data = CHDS)

To compute \(\Delta \hat{\beta_1}\):

100*(summary(lm.bwfull)$coefficients[2,1] - summary(lm.full_age)$coefficients[2,1]) / ( summary(lm.full_age)$coefficients[2,1])
## [1] 0.3031967

To compute \(\Delta \hat{\beta_2}\):

100*(summary(lm.bwfull)$coefficients[3,1] - summary(lm.full_age)$coefficients[3,1]) / ( summary(lm.full_age)$coefficients[3,1])
## [1] 0.06838896

To compute \(\Delta \hat{\beta_3}\):

100*(summary(lm.bwfull)$coefficients[4,1] - summary(lm.full_age)$coefficients[4,1]) / ( summary(lm.full_age)$coefficients[4,1])
## [1] -0.2120098

The above analysis shows that removal of maternal height from the model does not change \(\hat{\beta_i}\) for any i by greater than \(10\%\), and thus maternal height cannot be considered a confounder of the smoking-birthweight relation. It can be included in the full model as a political variable.

2.5 Full Model Diagnostics

2.5.1 Residual Analysis

The following R code extracts the fitted values, residuals, standardized residuals, and studentized residuals; and plots them so that our regression assumptions can be qualitatively assessed:

fitted <- lm.bwfull$fitted.values # extract fitted values
resm <- resid(lm.bwfull)          # extract model residuals
standm <- rstandard(lm.bwfull)    # extract standardized residuals
studm <- rstudent(lm.bwfull)      # extract studentized residuals

#par(mfrow = c(3,1))                   # arranges plots
# residual plot:
#plot(fitted, resm, main = "Plot of Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Residual (lbs)")
#abline(0,0, col = "red")
# standardized residual plot:
plot(fitted, standm, main = "Plot of Standardized Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Standardized Residual")
abline(0,0, col = "red")

# studentized residual plot:
plot(fitted, studm, main = "Plot of Studentized Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Studentized Residual")
abline(0,0, col = "red")

The above plots do not show any overt change in variance with increasing \(\hat{Y}\); suggesting that the homoscedasticity assumption is met. Furthermore the residuals do not seem to be asymmetrically distributed, skewed, or multi-modal in any way; suggesting that the normality assumption is likely met. The reference cell coding approach for creating dummy variables for smoking category ensures that the linearity assumption is met.

2.5.2 Normality Analysis

Due to the large sample size the Shapiro-Wilk test will not provide helpful insights and thus will not be performed. The following R code creates a QQ Plot:

qqnorm(resm, ylab = "Residuals", main = "Q-Q Plot of Residuals")
qqline(resm)

qqnorm(standm, ylab = "Standardized Residuals", main = "Q-Q Plot of Standardized Residuals")
qqline(standm)

qqnorm(studm, ylab = "Studentized Residuals", main = "Q-Q Plot of Studentized Residuals")
qqline(studm)

Q-Q plots suggest that the normality assumption is met. The handfull of points at the very edges of the plot may or may not be outliers, which brings us to our next section:

2.5.3 Influential Outlier Detection

The following code adds leverages & Cook’s distance to the dataset:

k <- 8                                  # number of predictors in association model
h_limit <- 2*(1+k)/nrow(CHDS)           # recommended leverage threshold
CHDS$h <- hatvalues(lm.bwfull)          # Add leverages to the dataset
CHDS$c <- cooks.distance(lm.bwfull)     # add Cook's distance to the dataset

The following code checks if any data points have Cook’s distance, \(c\), greater than 1 or if the leverage, \(h\), is greater than the recommended threshold of \(2(k+1)/n\).

CHDS[CHDS$c>1,]
##  [1] X       bwt     gestwks age     mnocig  mheight mppwt   BMI     BMI_cat
## [10] SMK_cat SMK1    SMK2    SMK3    h       c      
## <0 rows> (or 0-length row.names)
CHDS[CHDS$h>h_limit,]
##       X  bwt gestwks age mnocig mheight mppwt      BMI BMI_cat SMK_cat SMK1
## 9     9  3.3      29  32      0      64   142 24.37158       1       0    0
## 60   60  6.8      48  25      0      66   134 21.62580       1       0    0
## 84   84  6.5      34  38     35      61   108 20.40419       1       3    0
## 111 111  5.3      35  41      7      65   125 20.79882       1       1    1
## 166 166  5.3      33  20      7      63   109 19.30637       1       1    1
## 167 167  8.6      37  25     35      66   246 39.70110       4       3    0
## 226 226  8.8      41  36     12      71   173 24.12597       1       2    0
## 273 273  6.1      44  34     12      59   100 20.19535       1       2    0
## 319 319  7.3      37  40      7      68   134 20.37240       1       1    1
## 356 356  8.1      44  30      2      58   150 31.34661       3       1    1
## 400 400 10.0      38  32      0      67   215 33.67008       3       0    0
## 404 404  7.6      39  26      0      66   210 33.89118       3       0    0
## 445 445  7.3      46  22      2      62   164 29.99272       2       1    1
## 505 505  3.9      33  24      0      58    99 20.68876       1       0    0
## 506 506  4.5      38  25     25      66   200 32.27732       3       3    0
## 610 610  9.1      40  20      0      65   210 34.94201       3       0    0
## 655 655  5.4      38  38     17      63   144 25.50567       2       2    0
##     SMK2 SMK3          h            c
## 9      0    0 0.05645318 0.0342629837
## 60     0    0 0.03050297 0.0388201182
## 84     0    1 0.03206170 0.0042280228
## 111    0    0 0.03066412 0.0051370087
## 166    0    0 0.02956373 0.0008499634
## 167    0    1 0.08864574 0.0155859567
## 226    1    0 0.03546579 0.0031847055
## 273    1    0 0.03480005 0.0114129996
## 319    0    0 0.02666877 0.0002503806
## 356    0    0 0.05108750 0.0006273953
## 400    0    0 0.03867176 0.0235010285
## 404    0    0 0.03706427 0.0019194355
## 445    0    0 0.04871843 0.0255731624
## 505    0    0 0.03261216 0.0134463740
## 506    0    1 0.03712815 0.0500263277
## 610    0    0 0.04532885 0.0034138906
## 655    1    0 0.02684575 0.0071506225

Upon examination of the data reported in the second table, none of the data is implausible, with the exception of a 48 week gestational age. Of the women in a study of ultrasound-based determination of gestational age (n=1867), none had gestation age greater than 46 weeks. The study further suggests that a 48-week gestational age is well above the 0.1th percentile of gestational age estimates based on date of last period.1

We can create a new dataset with this observation (observation 60) removed:

CHDS_60 <- CHDS[-c(60),]

And a new model can be created:

lm.bwfull_60 <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + age, data = CHDS_60)
summary(lm.bwfull_60)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age, data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2771 -0.6106  0.0142  0.5561  3.2634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.100887   1.256253  -6.448 2.16e-10 ***
## SMK1        -0.180649   0.105766  -1.708   0.0881 .  
## SMK2        -0.550760   0.126018  -4.370 1.44e-05 ***
## SMK3        -0.410902   0.095487  -4.303 1.93e-05 ***
## gestwks      0.241351   0.019546  12.348  < 2e-16 ***
## BMI          0.044959   0.014016   3.208   0.0014 ** 
## mheight      0.082446   0.014603   5.646 2.43e-08 ***
## age         -0.003674   0.006626  -0.554   0.5795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared:  0.2743, Adjusted R-squared:  0.2667 
## F-statistic: 36.23 on 7 and 671 DF,  p-value: < 2.2e-16

Upon comparison to the previous full model we can see that most \(\hat{\beta}_i\) have very similar values, with very slight changes in \(\hat{\beta}_0\) & \(\hat{\beta}_4\).

2.5.4 Collinearity Assessment

The variance inflation factor approach will be used to assess collinearity. The following R code calculates the factor for each \(\hat{\beta}_i\):

vif(lm.bwfull_60)
##     SMK1     SMK2     SMK3  gestwks      BMI  mheight      age 
## 1.116177 1.080577 1.094439 1.012350 1.051675 1.018968 1.016212
mean(vif(lm.bwfull_60))
## [1] 1.055771

All factors are well below 10; indicating that collinearity is likely minimal.

2.6 Statistical Inference

2.6.1 Regression line:

If we retain the definition of the \(X_i\) as above, we can write the regression line as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \beta_7 X_7 + \epsilon\]

2.6.2 Confidence Intervals

For the crude model:

confint(lm.bwcrude, level = 0.95)
##                  2.5 %     97.5 %
## (Intercept)  7.6259530  7.8396639
## SMK1        -0.5838591 -0.1205927
## SMK2        -1.0381668 -0.4819955
## SMK3        -0.6785080 -0.2548011

For the adjusted model:

confint(lm.bwfull_60, level = 0.95)
##                    2.5 %       97.5 %
## (Intercept) -10.56754782 -5.634226121
## SMK1         -0.38832099  0.027022082
## SMK2         -0.79819783 -0.303322438
## SMK3         -0.59839171 -0.223411760
## gestwks       0.20297239  0.279729815
## BMI           0.01743780  0.072479511
## mheight       0.05377405  0.111118368
## age          -0.01668479  0.009337101

The confidence intervals are slightly tighter in the crude model.

2.6.3 P-values

For the crude model:

summary(lm.bwcrude)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3, data = CHDS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4328 -0.7328 -0.0328  0.7214  3.6672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.73281    0.05442 142.091  < 2e-16 ***
## SMK1        -0.35223    0.11797  -2.986  0.00293 ** 
## SMK2        -0.76008    0.14163  -5.367 1.10e-07 ***
## SMK3        -0.46665    0.10790  -4.325 1.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.062 on 676 degrees of freedom
## Multiple R-squared:  0.0585, Adjusted R-squared:  0.05432 
## F-statistic:    14 on 3 and 676 DF,  p-value: 7.294e-09

For the adjusted model:

summary(lm.bwfull_60)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age, data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2771 -0.6106  0.0142  0.5561  3.2634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.100887   1.256253  -6.448 2.16e-10 ***
## SMK1        -0.180649   0.105766  -1.708   0.0881 .  
## SMK2        -0.550760   0.126018  -4.370 1.44e-05 ***
## SMK3        -0.410902   0.095487  -4.303 1.93e-05 ***
## gestwks      0.241351   0.019546  12.348  < 2e-16 ***
## BMI          0.044959   0.014016   3.208   0.0014 ** 
## mheight      0.082446   0.014603   5.646 2.43e-08 ***
## age         -0.003674   0.006626  -0.554   0.5795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared:  0.2743, Adjusted R-squared:  0.2667 
## F-statistic: 36.23 on 7 and 671 DF,  p-value: < 2.2e-16

The p-values are generally better in the crude model.

3 Prediction Model Building by Backward-Elimination

To create a parsimonious model we will utilize the backward-elimination procedure starting from the full model above.

3.1 First Iteration

The summary function reports the coefficients and associated p-values (right column of table) from the full model:

summary(lm.bwfull_60)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age, data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2771 -0.6106  0.0142  0.5561  3.2634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.100887   1.256253  -6.448 2.16e-10 ***
## SMK1        -0.180649   0.105766  -1.708   0.0881 .  
## SMK2        -0.550760   0.126018  -4.370 1.44e-05 ***
## SMK3        -0.410902   0.095487  -4.303 1.93e-05 ***
## gestwks      0.241351   0.019546  12.348  < 2e-16 ***
## BMI          0.044959   0.014016   3.208   0.0014 ** 
## mheight      0.082446   0.014603   5.646 2.43e-08 ***
## age         -0.003674   0.006626  -0.554   0.5795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared:  0.2743, Adjusted R-squared:  0.2667 
## F-statistic: 36.23 on 7 and 671 DF,  p-value: < 2.2e-16

We can see that maternal age has the highest p-value (0.5795), and that the p-value is well above \(\alpha = 0.10\); thus it will be removed. We can begin the second iteration with the full model excluding maternal age:

3.2 Second Iteration

The following R code takes the full model and creates a reduced model by removing gestational age. The summary function then reports the coefficients and associated p-values:

lm.bwfull_60_age <- lm(bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, data = CHDS_60)
summary(lm.bwfull_60_age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, 
##     data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2487 -0.6078  0.0073  0.5562  3.2309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.16421    1.25041  -6.529 1.30e-10 ***
## SMK1        -0.18009    0.10571  -1.704  0.08890 .  
## SMK2        -0.55037    0.12595  -4.370 1.44e-05 ***
## SMK3        -0.41179    0.09542  -4.315 1.83e-05 ***
## gestwks      0.24136    0.01954  12.355  < 2e-16 ***
## BMI          0.04403    0.01391   3.166  0.00162 ** 
## mheight      0.08226    0.01459   5.638 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared:  0.274,  Adjusted R-squared:  0.2675 
## F-statistic: 42.26 on 6 and 672 DF,  p-value: < 2.2e-16

We can see that all p-values are above our threshold of \(\alpha = 0.10\); thus no further variables will be removed. We can construct the parsimonious model:

3.3 Variable Selection

If we denote gestational age as \(X_4\), BMI as \(X_5\), & maternal height as \(X_6\) we can state the parsimonious model as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon\]

The following R code reports summary statistics and ANOVA table for this model:

summary(lm.bwfull_60_age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, 
##     data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2487 -0.6078  0.0073  0.5562  3.2309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.16421    1.25041  -6.529 1.30e-10 ***
## SMK1        -0.18009    0.10571  -1.704  0.08890 .  
## SMK2        -0.55037    0.12595  -4.370 1.44e-05 ***
## SMK3        -0.41179    0.09542  -4.315 1.83e-05 ***
## gestwks      0.24136    0.01954  12.355  < 2e-16 ***
## BMI          0.04403    0.01391   3.166  0.00162 ** 
## mheight      0.08226    0.01459   5.638 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared:  0.274,  Adjusted R-squared:  0.2675 
## F-statistic: 42.26 on 6 and 672 DF,  p-value: < 2.2e-16
anova(lm.bwfull_60_age)
## Analysis of Variance Table
## 
## Response: bwt
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## SMK1        1   2.28   2.277   2.6030   0.10713    
## SMK2        1  24.16  24.158  27.6159 1.989e-07 ***
## SMK3        1  21.32  21.316  24.3672 1.005e-06 ***
## gestwks     1 139.65 139.650 159.6407 < 2.2e-16 ***
## BMI         1   6.63   6.626   7.5744   0.00608 ** 
## mheight     1  27.80  27.803  31.7825 2.540e-08 ***
## Residuals 672 587.85   0.875                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Model Diagnostics

3.4.1 Residual Analysis

The following R code extracts the fitted values, residuals, standardized residuals, and studentized residuals; and plots them so that our regression assumptions can be qualitatively assessed:

fitted <- lm.bwfull_60_age$fitted.values # extract fitted values
resm <- resid(lm.bwfull_60_age)          # extract model residuals
standm <- rstandard(lm.bwfull_60_age)    # extract standardized residuals
studm <- rstudent(lm.bwfull_60_age)      # extract studentized residuals

#par(mfrow = c(3,1))                   # arranges plots
# residual plot:
#plot(fitted, resm, main = "Plot of Model Residuals as a Function of Fitted Values for Birth Weight", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Residual (lbs)")
#abline(0,0, col = "red")
# standardized residual plot:
plot(fitted, standm, main = "Plot of Standardized Model Residuals as a Function of Fitted Values for Birth Weight - Parsimonious Model", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Standardized Residual")
abline(0,0, col = "red")

# studentized residual plot:
plot(fitted, studm, main = "Plot of Studentized Model Residuals as a Function of Fitted Values for Birth Weight - Parsimonious Model", xlab = "Fitted Birth Weight Value (lbs)", ylab = "Studentized Residual")
abline(0,0, col = "red")

Similarly to the residual analysis for the association model, the above plots do not show any overt change in variance with increasing \(\hat{Y}\); suggesting that the homoscedasticity assumption is met. Furthermore the residuals do not seem to be asymmetrically distributed, skewed, or multi-modal in any way; suggesting that the normality assumption is likely met. The reference cell coding approach for creating dummy variables for smoking category ensures that the linearity assumption is met.

3.4.2 Normality Analysis

Due to the large sample size the Shapiro-Wilk test will not provide helpful insights and thus will not be performed. The following R code creates a QQ Plot:

qqnorm(resm, ylab = "Residuals", main = "Q-Q Plot of Residuals - Parsimonious Model")
qqline(resm)

qqnorm(standm, ylab = "Standardized Residuals", main = "Q-Q Plot of Standardized Residuals - Parsimonious Model")
qqline(standm)

qqnorm(studm, ylab = "Studentized Residuals", main = "Q-Q Plot of Studentized Residuals - Parsimonious Model")
qqline(studm)

Similarly to the normality analysis for the association model, the Q-Q plots suggest that the normality assumption is met. The handfull of points at the very edges of the plot may or may not be outliers, which brings us to our next section:

3.4.3 Influential Outlier Detection

The following code adds leverages & Cook’s distance to the dataset:

k <- 7                                       # number of predictors in prediction model
h_limit <- 2*(1+k)/nrow(CHDS_60)             # recommended leverage threshold
CHDS_60$h <- hatvalues(lm.bwfull_60_age)     # Add leverages to the dataset
CHDS_60$c <- cooks.distance(lm.bwfull_60_age)# add Cook's distance to the dataset

The following code checks if any data points have Cook’s distance, \(c\), greater than 1 or if the leverage, \(h\), is greater than the recommended threshold of \(2(k+1)/n\).

CHDS[CHDS_60$c>1,]
##  [1] X       bwt     gestwks age     mnocig  mheight mppwt   BMI     BMI_cat
## [10] SMK_cat SMK1    SMK2    SMK3    h       c      
## <0 rows> (or 0-length row.names)
CHDS[CHDS_60$h>h_limit,]
##       X bwt gestwks age mnocig mheight mppwt      BMI BMI_cat SMK_cat SMK1 SMK2
## 9     9 3.3      29  32      0      64   142 24.37158       1       0    0    0
## 83   83 7.5      43  26      0      70   150 21.52041       1       0    0    0
## 98   98 8.1      40  28      0      70   125 17.93367       0       0    0    0
## 165 165 6.4      39  19      7      64   118 20.25244       1       1    1    0
## 166 166 5.3      33  20      7      63   109 19.30637       1       1    1    0
## 225 225 7.0      40  36      0      59   140 28.27348       2       0    0    0
## 272 272 6.4      37  21     25      64   124 21.28223       1       3    0    0
## 290 290 6.6      43  25      0      67   133 20.82847       1       0    0    0
## 334 334 6.4      41  28      0      62   132 24.14048       1       0    0    0
## 355 355 5.6      40  32      7      63   105 18.59788       1       1    1    0
## 399 399 6.9      39  18      2      61   108 20.40419       1       1    1    0
## 403 403 7.0      40  26      0      66   140 22.59412       1       0    0    0
## 444 444 8.5      39  17      0      64   118 20.25244       1       0    0    0
## 504 504 7.1      40  20      0      61   104 19.64848       1       0    0    0
## 505 505 3.9      33  24      0      58    99 20.68876       1       0    0    0
## 609 609 8.5      42  23      0      66   180 29.04959       2       0    0    0
## 619 619 7.1      41  21     25      65   120 19.96686       1       3    0    0
## 631 631 7.7      40  27      0      64   118 20.25244       1       0    0    0
## 666 666 8.0      39  20      0      66   125 20.17332       1       0    0    0
##     SMK3           h            c
## 9      0 0.056453181 3.426298e-02
## 83     0 0.013481686 3.699793e-03
## 98     0 0.012951382 1.361513e-05
## 165    0 0.011889995 1.248300e-03
## 166    0 0.029563732 8.499634e-04
## 225    0 0.021790121 9.858406e-04
## 272    1 0.012073188 7.081683e-05
## 290    0 0.008118873 4.669435e-03
## 334    0 0.005796544 1.790811e-03
## 355    0 0.013631991 5.582888e-03
## 399    0 0.014855471 3.022112e-05
## 403    0 0.003346639 3.889625e-04
## 444    0 0.007231896 1.167893e-03
## 504    0 0.008278008 9.644429e-05
## 505    0 0.032612159 1.344637e-02
## 609    0 0.017839103 7.476289e-05
## 619    1 0.010008126 2.892465e-04
## 631    0 0.003367634 2.158779e-06
## 666    0 0.005644973 1.377047e-04

Upon examination of the data reported in the second table, none of the data is implausible.

3.4.4 Collinearity Assessment

The variance inflation factor approach will be used to assess collinearity. The following R code calculates the factor for each \(\hat{\beta}_i\):

vif(lm.bwfull_60_age)
##     SMK1     SMK2     SMK3  gestwks      BMI  mheight 
## 1.116076 1.080543 1.094130 1.012350 1.036611 1.018422
mean(vif(lm.bwfull_60_age))
## [1] 1.059689

All factors are well below 10; indicating that collinearity is likely minimal.

3.5 Statistical Inference

3.5.1 Regression line:

If we retain the definition of the \(X_i\) as above, we can write the regression line as:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon\]

3.5.2 Confidence Intervals

For the adjusted association model:

confint(lm.bwfull_60, level = 0.95)
##                    2.5 %       97.5 %
## (Intercept) -10.56754782 -5.634226121
## SMK1         -0.38832099  0.027022082
## SMK2         -0.79819783 -0.303322438
## SMK3         -0.59839171 -0.223411760
## gestwks       0.20297239  0.279729815
## BMI           0.01743780  0.072479511
## mheight       0.05377405  0.111118368
## age          -0.01668479  0.009337101

For the prediction model:

confint(lm.bwfull_60_age, level = 0.95)
##                    2.5 %      97.5 %
## (Intercept) -10.61938216 -5.70903683
## SMK1         -0.38764413  0.02746481
## SMK2         -0.79767379 -0.30306260
## SMK3         -0.59915690 -0.22442405
## gestwks       0.20300269  0.27972030
## BMI           0.01671973  0.07133750
## mheight       0.05360913  0.11090837

The confidence intervals are quite similar between both models.

3.5.3 P-values

For the final association model:

summary(lm.bwfull_60)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight + 
##     age, data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2771 -0.6106  0.0142  0.5561  3.2634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.100887   1.256253  -6.448 2.16e-10 ***
## SMK1        -0.180649   0.105766  -1.708   0.0881 .  
## SMK2        -0.550760   0.126018  -4.370 1.44e-05 ***
## SMK3        -0.410902   0.095487  -4.303 1.93e-05 ***
## gestwks      0.241351   0.019546  12.348  < 2e-16 ***
## BMI          0.044959   0.014016   3.208   0.0014 ** 
## mheight      0.082446   0.014603   5.646 2.43e-08 ***
## age         -0.003674   0.006626  -0.554   0.5795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9358 on 671 degrees of freedom
## Multiple R-squared:  0.2743, Adjusted R-squared:  0.2667 
## F-statistic: 36.23 on 7 and 671 DF,  p-value: < 2.2e-16

For the prediction model:

summary(lm.bwfull_60_age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, 
##     data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2487 -0.6078  0.0073  0.5562  3.2309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.16421    1.25041  -6.529 1.30e-10 ***
## SMK1        -0.18009    0.10571  -1.704  0.08890 .  
## SMK2        -0.55037    0.12595  -4.370 1.44e-05 ***
## SMK3        -0.41179    0.09542  -4.315 1.83e-05 ***
## gestwks      0.24136    0.01954  12.355  < 2e-16 ***
## BMI          0.04403    0.01391   3.166  0.00162 ** 
## mheight      0.08226    0.01459   5.638 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared:  0.274,  Adjusted R-squared:  0.2675 
## F-statistic: 42.26 on 6 and 672 DF,  p-value: < 2.2e-16

The p-values are generally similar between both models.

4 Mediation analysis

Based on our preliminary research, the possible relations between the variables in the dataset are depicted below:

DAG of possible mediators by BMI or gestational age

If BMI truly does influence gestational age then analysis for “intertwined” mediation would be required (e.g. propensity matching). To see if this is required, we can test \(H_0: \beta_1' = 0\), at significance \(\alpha = 0.10\), where:

\[gestational \; age = \beta_0' + \beta_1' \times BMI + \beta_2' \times light \; smoking + \beta_3' \times moderate \; smoking + \beta_4' \times heavy \; smoking + \epsilon\]

The following R code creates a crude linear model and performs the hypothesis test:

lm.ga <- lm(gestwks ~ BMI + SMK1 + SMK2 + SMK3, data = CHDS_60)
summary(lm.ga)
## 
## Call:
## lm(formula = gestwks ~ BMI + SMK1 + SMK2 + SMK3, data = CHDS_60)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9440  -0.9178   0.0979   1.0855   6.3526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.65007    0.60440  65.602   <2e-16 ***
## BMI          0.01206    0.02736   0.441   0.6595    
## SMK1        -0.36435    0.20776  -1.754   0.0799 .  
## SMK2        -0.51005    0.24699  -2.065   0.0393 *  
## SMK3        -0.23840    0.18794  -1.269   0.2051    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.846 on 674 degrees of freedom
## Multiple R-squared:  0.01067,    Adjusted R-squared:  0.004795 
## F-statistic: 1.817 on 4 and 674 DF,  p-value: 0.1238

The p-value is well above our pre-specified significance level, \(\alpha = 0.10\); therefore the influence of BMI on gestational age can be ignored for the purposes of this mediation analysis. Thus we can perform two separate mediation analyses for each proposed mediator. The following DAGs depict the hypothesized relations:

Mediation by BMI:

DAG of possible mediaton by BMI

Mediation by Gestational Age:

DAG of possible mediaton by gestational age

4.1 BMI

4.1.1 Preliminary Work

To show that mediation of the smoking-gestational weight relation by BMI is plausible we can test if there is a linear relation between smoking and BMI, and between BMI and gestational age. For smoking and BMI:

lm.BMI <- lm(BMI ~ SMK1 + SMK2 + SMK3 + age, data = CHDS_60)
summary(lm.BMI)
## 
## Call:
## lm(formula = BMI ~ SMK1 + SMK2 + SMK3 + age, data = CHDS_60)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.917 -1.617 -0.351  1.069 18.404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.36065    0.48860  41.672  < 2e-16 ***
## SMK1        -1.20267    0.28670  -4.195  3.1e-05 ***
## SMK2        -0.74168    0.34406  -2.156  0.03146 *  
## SMK3        -0.46607    0.26212  -1.778  0.07584 .  
## age          0.05609    0.01813   3.093  0.00206 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.58 on 674 degrees of freedom
## Multiple R-squared:  0.0432, Adjusted R-squared:  0.03753 
## F-statistic: 7.609 on 4 and 674 DF,  p-value: 5.347e-06

All smoking categories were found to have a significant linear relation with BMI a pre-specified significance, \(\alpha = 0.10\). We can test BMI and birth weight:

summary(lm.bwfull_60_age)
## 
## Call:
## lm(formula = bwt ~ SMK1 + SMK2 + SMK3 + gestwks + BMI + mheight, 
##     data = CHDS_60)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2487 -0.6078  0.0073  0.5562  3.2309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.16421    1.25041  -6.529 1.30e-10 ***
## SMK1        -0.18009    0.10571  -1.704  0.08890 .  
## SMK2        -0.55037    0.12595  -4.370 1.44e-05 ***
## SMK3        -0.41179    0.09542  -4.315 1.83e-05 ***
## gestwks      0.24136    0.01954  12.355  < 2e-16 ***
## BMI          0.04403    0.01391   3.166  0.00162 ** 
## mheight      0.08226    0.01459   5.638 2.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9353 on 672 degrees of freedom
## Multiple R-squared:  0.274,  Adjusted R-squared:  0.2675 
## F-statistic: 42.26 on 6 and 672 DF,  p-value: < 2.2e-16

The p-value for the linear relation of BMI and gestational weight meets our pre-specified significance, \(\alpha = 0.10\).

4.1.2 2

set.seed(Sys.time())

4.2 Gestational Age

5 References

  1. Hoffman CS, Messer LC, Mendola P, Savitz DA, Herring AH, Hartmann KE. Comparison of gestational age at birth based on last menstrual period and ultrasound during the first trimester. Paediatr Perinat Epidemiol. 2008 Nov;22(6):587-96. doi: 10.1111/j.1365-3016.2008.00965.x. PMID: 19000297.