Diabetes mellitus (DM) is a major public health challenge, and identifying key risk factors—such as obesity, age, race, and gender—is essential for prevention and targeted intervention. Logistic regression is widely used to estimate associations between such risk factors and binary outcomes, including the presence or absence of diabetes. However, classical maximum likelihood estimation (MLE) can yield unstable results in small samples or in the presence of missing data, quasi-separation, or complete separation. Moreover, healthcare data (e.g., DNA sequences, imaging, patient-reported outcomes, electronic health records, and longitudinal health measurements) are often complex, making standard analytical approaches insufficient (Zeger et al. 2020).
Bayesian hierarchical models, implemented via Markov Chain Monte Carlo (MCMC), provide a framework that integrates prior knowledge and accounts for hierarchical data structures. These models have been successfully applied in predicting patient health status across diseases such as pneumonia, prostate cancer, and mental disorders (Zeger et al. 2020). Compared to frequentist approaches, Bayesian inference naturally quantifies uncertainty and accommodates complex covariate structures, though it remains limited by parametric assumptions.
Recent work has extended Bayesian methods to disease diagnostics. For example, Bayesian inference has been used to evaluate diagnostic test data from the National Health and Nutrition Examination Survey (NHANES), comparing parametric and nonparametric approaches. These methods provide posterior probabilities that improve disease classification, especially where conventional dichotomous thresholds fail to capture heterogeneity in populations (Chatzimichail and Hatjimihail 2023). Similarly, Bayesian clinical reasoning models have been applied to cardiovascular risk prediction, emulating clinician decision-making by incorporating demographic, metabolic, and conventional risk factors (Liu et al. 2013).
Bayesian regression also addresses methodological challenges such as missing data. Multiple imputation combined with Bayesian modeling has been applied to clinical research settings, generating robust estimates under assumptions of missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) (Austin et al. 2021). These applications highlight the versatility of Bayesian approaches in healthcare, particularly when traditional models are undermined by data limitations.
Related Work
The broader Bayesian literature emphasizes the importance of prior specification, model checking, and variable selection. Vande Schoot et al. highlight the role of informative, weakly informative, and diffuse priors, noting that prior elicitation can draw from experts, data-based approaches, or maximum likelihood estimates (Vande Schoot et al. 2021). Priors not only regularize estimates but also improve performance in small samples. Tutorials demonstrate how packages such as brms and blavaan in R, combined with MCMC, allow estimation of posterior distributions and facilitate empirical Bayesian analysis (Klauenberg et al. 2015).
In meta-analytic settings, Bayesian hierarchical regression has been used to augment data with results from prior studies. This approach incorporates both exchangeable and unexchangeable predictors, enabling explicit testing of heterogeneity across studies (Leeuw and Klugkist 2012). Such hierarchical modeling strengthens inference in small-sample or multi-study contexts where frequentist regression falls short.
Applications extend beyond traditional regression. Baldwin and Larson illustrated Bayesian regression with EEG and anxiety data, showing how priors and posterior distributions yield richer probabilistic interpretations than frequentist results (Baldwin and Larson 2017). Kruschke and Liddell framed Bayesian reasoning as intuitive, reflecting how individuals update beliefs with new evidence (Kruschke and Liddell 2017). Abdullah et al. reviewed Bayesian deep learning in healthcare, highlighting its role in uncertainty quantification for tasks such as medical imaging and disease classification (Abdullah, Hassan, and Mustafa 2022).
Together, these works underscore the adaptability of Bayesian methods across domains, while also noting challenges including computational demands, subjective prior specification, and the need for careful convergence diagnostics. This literature directly motivates our project, which applies Bayesian logistic regression to NHANES survey data to explore predictors of diabetes outcomes while addressing quasi-separation, missingness, and small effective sample size.
Our question: Using NHANES 2013–2014, what is the association between key predictors (BMI category, age, sex, race/ethnicity) and a diabetes-related outcome, and does a Bayesian approach yield more stable inference than a frequentist baseline under missingness and potential separation?
Outcome candidate for now: DIQ240 (usual diabetes doctor; a diabetes-related marker, not a diagnosis).
Code
# Load packages for this reportlibrary(tidyverse)library(knitr)# Build merged dataset if missing (uses R/data_prep.R from the repo)if (!file.exists("data/merged_2013_2014.rds")) {source("R/data_prep.R")}# Load merged NHANES data created by R/data_prep.Rmerged_data <-readRDS("data/merged_2013_2014.rds")# Quick peekknitr::kable(head(merged_data))
SEQN
BMDBMIC
RIDAGEYR
RIAGENDR
RIDRETH1
SDMVPSU
SDMVSTRA
WTMEC2YR
DIQ240
73557
NA
69
Male
Non-Hispanic Black
1
112
13481.04
Yes
73558
NA
54
Male
Non-Hispanic White
1
108
24471.77
Yes
73559
NA
72
Male
Non-Hispanic White
1
109
57193.29
No
73560
Normal weight
9
Male
Non-Hispanic White
2
109
55766.51
NA
73561
NA
73
Female
Non-Hispanic White
2
116
65541.87
NA
73562
NA
56
Male
Mexican American
1
111
25344.99
NA
Basic Exploration
Code
# Safe tabulations (keep NA visible)table(merged_data$BMDBMIC, useNA ="ifany")
Main: Bayesian logistic regression with weakly-informative priors for stability and honest uncertainty intervals.
Missingness: Prefer multiple imputation (or Bayesian models with missing data mechanisms) over listwise deletion to retain ~9,800 observations and avoid separation artifacts. We will run prior sensitivity checks.
Modeling (placeholders)
Code
# Example skeletons (commented until outcome is finalized)# library(rstanarm) # or library(brms)# Frequentist baseline:# fit_mle <- glm(outcome ~ BMDBMIC + RIDAGEYR + RIAGENDR + RIDRETH1 + DIQ240,# data = merged_data, family = binomial())# Bayesian (weakly-informative priors):# fit_bayes <- rstanarm::stan_glm(# outcome ~ BMDBMIC + RIDAGEYR + RIAGENDR + RIDRETH1 + DIQ240,# data = merged_data, family = binomial(),# prior = rstanarm::normal(0, 2.5),# prior_intercept = rstanarm::normal(0, 5),# chains = 4, iter = 2000, seed = 123# )# Next steps:# - finalize outcome (prefer DIQ010), run MI if needed, then fit both models# - compare odds ratios/posteriors, AUC, calibration, and survey-weighted variants
Results (to be populated)
Posterior summaries and credible intervals for effects.
Predictive performance vs. MLE baseline.
Sensitivity to priors; effect of handling missingness vs. deletion.
Discussion & Conclusion
Bayesian logistic regression is a good fit for survey data with missingness and potential separation, providing stable estimates and interpretable uncertainty.
Next: finalize the outcome (DIQ010), run MI, fit survey-aware models, and present results with clear figures and tables.
References
Abdullah, M., R. Hassan, and M. Mustafa. 2022. “A Review on Bayesian Deep Learning in Healthcare: Applications and Challenges.”IEEE Access 10: 36538–62. https://doi.org/10.1109/ACCESS.2022.3157141.
Austin, P. C., I. R. White, D. S. Lee, and S. van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.”Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Baldwin, S. A., and M. J. Larson. 2017. “An Introduction to Using Bayesian Linear Regression with Clinical Data.”Behaviour Research and Therapy 98: 58–75. https://doi.org/10.1016/j.brat.2017.05.014.
Chatzimichail, T., and A. T. Hatjimihail. 2023. “A Bayesian Inference-Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.”Diagnostics 13 (19): 3135. https://doi.org/10.3390/diagnostics13193135.
Klauenberg, K., G. Wübbeler, B. Mickan, P. Harris, and C. Elster. 2015. “A Tutorial on Bayesian Normal Linear Regression.”Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Kruschke, J. K., and T. M. Liddell. 2017. “Bayesian Data Analysis for Newcomers.”Psychonomic Bulletin & Review 25 (1): 155–77. https://doi.org/10.3758/s13423-017-1272-1.
Leeuw, C. de, and I. Klugkist. 2012. “Augmenting Data with Published Results in Bayesian Linear Regression.”Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Y. M., S. L. S. Chen, A. M. F. Yen, and H. H. Chen. 2013. “Individual Risk Prediction Model for Incident Cardiovascular Disease: A Bayesian Clinical Reasoning Approach.”International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/j.ijcard.2012.05.016.
Vande Schoot, R., S. Depaoli, R. King, B. Kramer, K. Märtens, M. G. Tadesse, M. Vannucci, et al. 2021. “Bayesian Statistics and Modelling.”Nature Reviews Methods Primers 1: 1–26. https://doi.org/10.1038/s43586-020-00001-2.
Zeger, S. L., Z. Wu, Y. Coley, A. T. Fojo, B. Carter, K. O’Brien, P. Zandi, et al. 2020. “Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment.” 272. Johns Hopkins Biostatistics Working Paper Series. https://biostats.bepress.com/jhubiostat/paper272.