Bayesian Logistic Regression for Diabetes Risk (NHANES 2013–2014)

Capstone Report

Authors

Namita Mishra

Autumn Wilcox

Published

September 28, 2025

Slides: slides.html (Edit slides.qmd.)

Introduction

Diabetes mellitus (DM) is a major public health challenge, and identifying key risk factors—such as obesity, age, race, and gender—is essential for prevention and targeted intervention. Logistic regression is widely used to estimate associations between such risk factors and binary outcomes, including the presence or absence of diabetes. However, classical maximum likelihood estimation (MLE) can yield unstable results in small samples or in the presence of missing data, quasi-separation, or complete separation. Moreover, healthcare data (e.g., DNA sequences, imaging, patient-reported outcomes, electronic health records, and longitudinal health measurements) are often complex, making standard analytical approaches insufficient (Zeger et al. 2020).

Bayesian hierarchical models, implemented via Markov Chain Monte Carlo (MCMC), provide a framework that integrates prior knowledge and accounts for hierarchical data structures. These models have been successfully applied in predicting patient health status across diseases such as pneumonia, prostate cancer, and mental disorders (Zeger et al. 2020). Compared to frequentist approaches, Bayesian inference naturally quantifies uncertainty and accommodates complex covariate structures, though it remains limited by parametric assumptions.

Recent work has extended Bayesian methods to disease diagnostics. For example, Bayesian inference has been used to evaluate diagnostic test data from the National Health and Nutrition Examination Survey (NHANES), comparing parametric and nonparametric approaches. These methods provide posterior probabilities that improve disease classification, especially where conventional dichotomous thresholds fail to capture heterogeneity in populations (Chatzimichail and Hatjimihail 2023). Similarly, Bayesian clinical reasoning models have been applied to cardiovascular risk prediction, emulating clinician decision-making by incorporating demographic, metabolic, and conventional risk factors (Liu et al. 2013).

Bayesian regression also addresses methodological challenges such as missing data. Multiple imputation combined with Bayesian modeling has been applied to clinical research settings, generating robust estimates under assumptions of missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) (Austin et al. 2021). These applications highlight the versatility of Bayesian approaches in healthcare, particularly when traditional models are undermined by data limitations.

Data & Preparation

Source: NHANES (CDC) 2013–2014.
Files: BMX_H (body measures), DEMO_H (demographics), DIQ_H (diabetes questionnaire).
Variables:
- Predictors/Covariates: BMDBMIC (BMI category), RIDAGEYR (age), RIAGENDR (sex), RIDRETH1 (race/ethnicity).
- Survey design: WTMEC2YR, SDMVPSU, SDMVSTRA.
- Outcome candidate for now: DIQ240 (usual diabetes doctor; a diabetes-related marker, not a diagnosis).

Code

# Load packages for this report
library(tidyverse)
library(knitr)

# Build merged dataset if missing (uses R/data_prep.R from the repo)
if (!file.exists("data/merged_2013_2014.rds")) {
  source("R/data_prep.R")
}

# Load merged NHANES data created by R/data_prep.R
merged_data <- readRDS("data/merged_2013_2014.rds")

# Quick peek
knitr::kable(head(merged_data))

SEQN	BMDBMIC	RIDAGEYR	RIAGENDR	RIDRETH1	SDMVPSU	SDMVSTRA	WTMEC2YR	DIQ240
73557	NA	69	Male	Non-Hispanic Black	1	112	13481.04	Yes
73558	NA	54	Male	Non-Hispanic White	1	108	24471.77	Yes
73559	NA	72	Male	Non-Hispanic White	1	109	57193.29	No
73560	Normal weight	9	Male	Non-Hispanic White	2	109	55766.51	NA
73561	NA	73	Female	Non-Hispanic White	2	116	65541.87	NA
73562	NA	56	Male	Mexican American	1	111	25344.99	NA

Basic Exploration

Code

# Safe tabulations (keep NA visible)
table(merged_data$BMDBMIC, useNA = "ifany")


  Underweight Normal weight    Overweight         Obese          <NA> 
          132          2167           595           629          6290

Code

table(merged_data$DIQ240,  useNA = "ifany")


 Yes   No <NA> 
 553  169 9091

Code

# Age distribution
ggplot(merged_data, aes(x = RIDAGEYR)) +
  geom_histogram(binwidth = 5, boundary = 0, closed = "left") +
  labs(title = "Age distribution (NHANES 2013–2014)",
       x = "Age (years)", y = "Count") +
  theme_minimal()

Code

# BMI category counts (codes as-is)
merged_data %>%
  mutate(BMDBMIC = factor(BMDBMIC, exclude = NULL)) %>%
  count(BMDBMIC) %>%
  ggplot(aes(x = BMDBMIC, y = n)) +
  geom_col() +
  labs(title = "Counts by BMDBMIC (BMI category code)",
       x = "BMDBMIC (code; NA common for adults)", y = "Count") +
  theme_minimal()

Survey Design

Code

# Survey design setup (weights/strata/PSU)
library(survey)
nhanes_design <- svydesign(
  id = ~SDMVPSU,
  strata = ~SDMVSTRA,
  weights = ~WTMEC2YR,
  nest = TRUE,
  data = merged_data
)

# Example: weighted mean age
svymean(~RIDAGEYR, nhanes_design, na.rm = TRUE)

           mean     SE
RIDAGEYR 37.504 0.4412

Methods

Baseline: Frequentist logistic regression (MLE).
Main: Bayesian logistic regression with weakly-informative priors for stability and honest uncertainty intervals.
Missingness: Prefer multiple imputation (or Bayesian models with missing data mechanisms) over listwise deletion to retain ~9,800 observations and avoid separation artifacts. We will run prior sensitivity checks.

Modeling (placeholders)

Code

# Example skeletons (commented until outcome is finalized)

# library(rstanarm)  # or library(brms)

# Frequentist baseline:
# fit_mle <- glm(outcome ~ BMDBMIC + RIDAGEYR + RIAGENDR + RIDRETH1 + DIQ240,
#                data = merged_data, family = binomial())

# Bayesian (weakly-informative priors):
# fit_bayes <- rstanarm::stan_glm(
#   outcome ~ BMDBMIC + RIDAGEYR + RIAGENDR + RIDRETH1 + DIQ240,
#   data = merged_data, family = binomial(),
#   prior = rstanarm::normal(0, 2.5),
#   prior_intercept = rstanarm::normal(0, 5),
#   chains = 4, iter = 2000, seed = 123
# )

# Next steps:
# - finalize outcome (prefer DIQ010), run MI if needed, then fit both models
# - compare odds ratios/posteriors, AUC, calibration, and survey-weighted variants

Results (to be populated)

Posterior summaries and credible intervals for effects.
Predictive performance vs. MLE baseline.
Sensitivity to priors; effect of handling missingness vs. deletion.

Discussion & Conclusion

Bayesian logistic regression is a good fit for survey data with missingness and potential separation, providing stable estimates and interpretable uncertainty.
Next: finalize the outcome (DIQ010), run MI, fit survey-aware models, and present results with clear figures and tables.

References

Abdullah, M., R. Hassan, and M. Mustafa. 2022. “A Review on Bayesian Deep Learning in Healthcare: Applications and Challenges.” IEEE Access 10: 36538–62. https://doi.org/10.1109/ACCESS.2022.3157141.

Austin, P. C., I. R. White, D. S. Lee, and S. van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.

Baldwin, S. A., and M. J. Larson. 2017. “An Introduction to Using Bayesian Linear Regression with Clinical Data.” Behaviour Research and Therapy 98: 58–75. https://doi.org/10.1016/j.brat.2017.05.014.

Chatzimichail, T., and A. T. Hatjimihail. 2023. “A Bayesian Inference-Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19): 3135. https://doi.org/10.3390/diagnostics13193135.

Klauenberg, K., G. Wübbeler, B. Mickan, P. Harris, and C. Elster. 2015. “A Tutorial on Bayesian Normal Linear Regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.

Kruschke, J. K., and T. M. Liddell. 2017. “Bayesian Data Analysis for Newcomers.” Psychonomic Bulletin & Review 25 (1): 155–77. https://doi.org/10.3758/s13423-017-1272-1.

Leeuw, C. de, and I. Klugkist. 2012. “Augmenting Data with Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.

Liu, Y. M., S. L. S. Chen, A. M. F. Yen, and H. H. Chen. 2013. “Individual Risk Prediction Model for Incident Cardiovascular Disease: A Bayesian Clinical Reasoning Approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/j.ijcard.2012.05.016.

Vande Schoot, R., S. Depaoli, R. King, B. Kramer, K. Märtens, M. G. Tadesse, M. Vannucci, et al. 2021. “Bayesian Statistics and Modelling.” Nature Reviews Methods Primers 1: 1–26. https://doi.org/10.1038/s43586-020-00001-2.

Zeger, S. L., Z. Wu, Y. Coley, A. T. Fojo, B. Carter, K. O’Brien, P. Zandi, et al. 2020. “Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment.” 272. Johns Hopkins Biostatistics Working Paper Series. https://biostats.bepress.com/jhubiostat/paper272.

Introduction

Related Work

Data & Preparation

Basic Exploration

Survey Design

Methods

Modeling (placeholders)

Results (to be populated)

Discussion & Conclusion

References