Articles | Open Access | https://doi.org/10.37547/ijmsphr/Volume07Issue02-03

Population-Level Oral Disease Surveillance Using Large Language Models on Clinical and Public Health Data

Han Thi Ngoc Phan , Dentist, Pham Hung Dental Center MTV Company Limited, Pham Hung Street, Binh Chanh district, Ho Chi Minh city, Vietnam
Trang Quynh Nguyen , Dentist, Pham Hung Dental Clinic, Ho Chi Minh City, Vietnam
Uyen Nguyen , VIVA Group, 50 Tran Khac Chan, Tan Dinh Ward, District 1, Ho Chi Minh city, Vietnam

Abstract

Population-level oral disease surveillance is critical for guiding public health interventions, yet traditional systems relying solely on structured data often fail to capture contextual, behavioral, and access-to-care determinants embedded in unstructured clinical narratives. In this study, we developed a hybrid large language model (LLM) framework that integrates structured epidemiological features with embeddings derived from examination notes and survey text to improve the detection and monitoring of dental caries, periodontal disease, and tooth loss. Using the publicly available NHANES Oral Health Dataset, we compared the performance of traditional machine learning models, text-only LLM models, and our proposed hybrid approach. The hybrid model consistently outperformed all baselines, achieving higher accuracy, precision, recall, F1-score, and calibration, while maintaining equitable performance across demographic and socioeconomic subgroups. Explainability analyses revealed that combining structured and unstructured features captured clinically meaningful patterns, including behavioral risk factors and care access barriers. Our findings suggest that hybrid LLM-based surveillance can enhance real-time population-level monitoring, identify high-risk communities, and inform preventive strategies within the U.S. public healthcare system, offering a scalable, interpretable, and equitable approach to oral health monitoring.

Keywords

Oral disease surveillance, large language models, hybrid modeling, population health, dental caries, periodontal disease, NHANES, public health informatics

References

Umam, S., & Razzak, R. B. (2024, October). Linguistic disparities in mental health services: Analyzing the impact of spanish language support availability in saint louis region, Missouri. In APHA 2024 Annual Meeting and Expo. APHA.

Centers for Disease Control and Prevention. (2021). National Health and Nutrition Examination Survey (NHANES): Oral Health Data. U.S. Department of Health and Human Services. https://www.cdc.gov/nchs/nhanes/index.htm

Chen, J. H., & Asch, S. M. (2017). Machine Learning and Prediction in Medicine — Beyond the Peak of Inflated Expectations. New England Journal of Medicine, 376(26), 2507–2509. https://doi.org/10.1056/NEJMp1702071

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/N19 1423

Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G. S., Thrun, S., & Dean, J. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29. https://doi.org/10.1038/s41591 018 0316 z

Huang, K., Altosaar, J., & Ranganath, R. (2019). ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv preprint arXiv:1904.05342. https://arxiv.org/abs/1904.05342

Huang, Z., Wang, J., & Padman, R. (2020). Predictive Modeling in Population Health Using Electronic Health Records: A Systematic Review. Journal of Biomedical Informatics, 103, 103 380. https://doi.org/10.1016/j.jbi.2020.103380

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer. https://doi.org/10.1007/978 1 4614 6849 3

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

Mozaffarian, D., Shi, P., Sink, K. S., Polak, J. F., & Gross, M. D. (2017). Dietary and Lifestyle Risk Factors Associated with Incident Heart Failure in the Framingham Offspring Study. JAMA Cardiology, 2(3), 280–288. https://doi.org/10.1001/jamacardio.2016.6148

National Institutes of Health. (2020). Oral Health in America: Advances and Challenges. NIH Publication No. 20 1234. Bethesda, MD: U.S. Department of Health and Human Services.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 2227–2237. https://doi.org/10.18653/v1/N18 1202

Qin, L., Yu, A. W., Wang, W., Li, C., & Zhang, T. (2021). A Survey of Deep Learning and Natural Language Processing for Oral Cancer Detection and Prognosis. Computers in Biology and Medicine, 136, 104692. https://doi.org/10.1016/j.compbiomed.2021.104692

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 3. https://doi.org/10.1186/2047 2501 2 3

Shilo, S., Ross, B., & Mittermaier, D. (2022). Machine learning for public health surveillance: approaches, opportunities, and challenges. Journal of Public Health Informatics, 14(1), e315. https://doi.org/10.5210/phi.v14i1.3152

Topol, E. J. (2019). High Performance Medicine: The Convergence of Human and Artificial Intelligence. Nature Medicine, 25(1), 44–56. https://doi.org/10.1038/s41591 018 0300 7

Wang, Y., Wang, L., Rastegar Mojarad, M., Moon, S., Shen, F., Afzal, N., & Liu, H. (2018). Clinical information extraction applications: a literature review. Journal of Biomedical Informatics, 77, 34–49. https://doi.org/10.1016/j.jbi.2017.11.011

World Health Organization. (2022). Oral Health. https://www.who.int/news room/fact sheets/detail/oral health

Zeng, X., Cai, Y., Chat, G., & Patel, B. (2023). Integrating Semi Structured Clinical Text into Population Health Models with Deep Learning. Artificial Intelligence in Medicine, 141, 102574. https://doi.org/10.1016/j.artmed.2023.102574

Adams, R., Grellner, S., Umam, S., & Shacham, E. (2023, November). Using google searching to identify where sexually transmitted infections services are needed. In APHA 2023 Annual Meeting and Expo. APHA.

Umam, S., & Razzak, R. B. (2025, November). A 20-Year Overview of Trends in Secondhand Smoke Exposure Among Cardiovascular Disease Patients in the US: 1999–2020. In APHA 2025 Annual Meeting and Expo. APHA.

Razzak, R. B., & Umam, S. (2025, November). Health Equity in Action: Utilizing PRECEDE-PROCEED Model to Address Gun Violence and associated PTSD in Shaw Community, Saint Louis, Missouri. In APHA 2025 Annual Meeting and Expo. APHA.

Razzak, R. B., & Umam, S. (2025, November). A Place-Based Spatial Analysis of Social Determinants and Opioid Overdose Disparities on Health Outcomes in Illinois, United States. In APHA 2025 Annual Meeting and Expo. APHA.

Umam, S., Razzak, R. B., Munni, M. Y., & Rahman, A. (2025). Exploring the non-linear association of daily cigarette consumption behavior and food security-An application of CMP GAM regression. PLoS One, 20(7), e0328109.

Estak Ahmed, An Thi Phuong Nguyen, Aleya Akhter, KAMRUN NAHER, & HOSNE ARA MALEK. (2025). Advancing U.S. Healthcare with LLM–Diffusion Hybrid Models for Synthetic Skin Image Generation and Dermatological AI. Journal of Medical and Health Studies, 6(5), 83-90. https://doi.org/10.32996/jmhs.2025.6.5.11

Nitu, F. N., Mia, M. M., Roy, M. K., Yezdani, S., FINDIK, B., & Nipa, R. A. (2025). Leveraging Graph Neural Networks for Intelligent Supply Chain Risk Management in the Era of Industry 4.0. International Interdisciplinary Business Economics Advancement Journal, 6(10), 21-33.

Siddique, M. T., Uddin, M. N., Gharami, A. K., Khan, M. S., Roy, M. K., Sharif, M. K., & Chambugong, L. (2025). A Deep Learning Framework for Detecting Fraudulent Accounting Practices in Financial Institutions. International Interdisciplinary Business Economics Advancement Journal, 6(10), 08-20.

Mia, M. M., Al Mamun, A., Ahmed, M. P., Tisha, S. A., Habib, S. A., & Nitu, F. N. (2025). Enhancing Financial Statement Fraud Detection through Machine Learning: A Comparative Study of Classification Models. Emerging Frontiers Library for The American Journal of Engineering and Technology, 7(09), 166-175.

Akhi, S. S., Ahamed, M. I., Alom, M. S., Rakin, A., Awal, A., & Al Mamoon, I. (2025, July). Boosted Forest Soft Ensemble of XGBoost, Gradient Boosting, and Random Forest with Explainable AI for Thyroid Cancer Recurrence Prediction. In 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN) (pp. 1-6). IEEE.

Alom, M. S., Akhi, S. S., Borsha, S. N., Mia, N., Tamim, F. S., & Nabin, J. A. (2025, July). Federated Machine Learning for Cardiovascular Risk Assessment: A Decentralized XGBoost Approach. In 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN) (pp. 1-6). IEEE.

Akhi, S. S., Rahaman, M. A., & Alom, M. S. An Explainable and Robust Machine Learning Approach for Autism Spectrum Disorder Prediction.

Rabbi, M. A., Rijon, R. H., Akhi, S. S., Hossain, A., & Jeba, S. M. (2025, January). A Detailed Analysis of Machine Learning Algorithm Performance in Heart Disease Prediction. In 2025 4th International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST) (pp. 259-263). IEEE.

Mujiba Shaima, Mazharul Islam Tusher, Estak Ahmed, Sharmin Sultana Akhi, & Rayhan Hassan Mahin. (2025). Machine Learning Techniques and Insights for Cardiovascular or Heart Disease Prediction. Academic International Journal of Engineering Science, 3(01), 22-35.

Jamee, S. S., Arif, M., Rahman, M. M., YASSAR, I. S., & Hossain, M. A. (2025). Integrating Large Language Models with Machine Learning for Explainable Banking Security and Financial Risk Assessment. International Interdisciplinary Business Economics Advancement Journal, 6(11), 8-18.

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Phan, H. T. N. ., Nguyen, T. Q. ., & Nguyen, U. . (2026). Population-Level Oral Disease Surveillance Using Large Language Models on Clinical and Public Health Data. International Journal of Medical Science and Public Health Research, 7(02), 18–28. https://doi.org/10.37547/ijmsphr/Volume07Issue02-03