This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
A high proportion of health care services are persistently utilized by a small subpopulation of patients. To improve clinical outcomes while reducing costs and utilization, population health management programs often provide targeted interventions to patients who may become persistent high users/utilizers (PHUs). Enhanced prediction and management of PHUs can improve health care system efficiencies and improve the overall quality of patient care.
The aim of this study was to detect key classes of diseases and medications among the study population and to assess the predictive value of these classes in identifying PHUs.
This study was a retrospective analysis of insurance claims data of patients from the Johns Hopkins Health Care system. We defined a PHU as a patient incurring health care costs in the top 20% of all patients’ costs for 4 consecutive 6month periods. We used 2013 claims data to predict PHU status in 20142015. We applied latent class analysis (LCA), an unsupervised clustering approach, to identify patient subgroups with similar diagnostic and medication patterns to differentiate variations in health care utilization across PHUs. Logistic regression models were then built to predict PHUs in the full population and in select subpopulations. Predictors included LCA membership probabilities, demographic covariates, and health utilization covariates. Predictive powers of the regression models were assessed and compared using standard metrics.
We identified 164,221 patients with continuous enrollment between 2013 and 2015. The mean study population age was 19.7 years, 55.9% were women, 3.3% had ≥1 hospitalization, and 19.1% had 10+ outpatient visits in 2013. A total of 8359 (5.09%) patients were identified as PHUs in both 2014 and 2015. The LCA performed optimally when assigning patients to four probability disease/medication classes. Given the feedback provided by clinical experts, we further divided the population into four diagnostic groups for sensitivity analysis: acute upper respiratory infection (URI) (n=53,232; 4.6% PHUs), mental health (n=34,456; 12.8% PHUs), otitis media (n=24,992; 4.5% PHUs), and musculoskeletal (n=24,799; 15.5% PHUs). For the regression models predicting PHUs in the full population, the F1score classification metric was lower using a parsimonious model that included LCA categories (F1=38.62%) compared to that of a complex risk stratification model with a full set of predictors (F1=48.20%). However, the LCAenabled simple models were comparable to the complex model when predicting PHUs in the mental health and musculoskeletal subpopulations (F1scores of 48.69% and 48.15%, respectively). F1scores were lower than that of the complex model when the LCAenabled models were limited to the otitis media and acute URI subpopulations (45.77% and 43.05%, respectively).
Our study illustrates the value of LCA in identifying subgroups of patients with similar patterns of diagnoses and medications. Our results show that LCAderived classes can simplify predictive models of PHUs without compromising predictive accuracy. Future studies should investigate the value of LCAderived classes for predicting PHUs in other health care settings.
A small segment of the patient population utilizes a high volume of health care services [
Population health programs are often managed by insurers and health care providers [
Persistent high users/utilizers (PHUs) are patients who have a high utilization rate over an extended period (eg, a patient whose annual costs are in the top 20% of all patients’ costs over 4 consecutive 6month periods) [
PHUs constitute a small percentage of the patient population [
To address the difficulties of identifying common patterns of comorbidities among PHUs, in this study, we implemented an unsupervised clustering methodology, latent class analysis (LCA) [
The overall goal of our study was to identify subpopulations of PHUs where changes in care delivery could reduce the risk of high utilization. Our analysis aimed to automate the extraction of common probabilistic patterns of comorbidities and medications for PHUs, and then use such information to improve the prediction of PHUs among the study population as well as specific diagnostic subpopulations.
We defined a PHU as an individual whose medical charges remained in the top 20% of the highest health care costs for 4 consecutive 6month periods (ie, total of 2 years after the base period) [
We performed a retrospective analysis of the Johns Hopkins Health Care (JHHC) insurance claims data captured between 2013 and 2015. JHHC provides health insurance to a variety of enrollees, including Medicaid and employerbased members. JHHC enrollees can also seek care outside of the Johns Hopkins health system. We applied the Johns Hopkins Adjusted Clinical Groups (ACG) software to the claims data to generate additional health care utilization variables consistent with previous PHU analyses [
Our initial sample population included 207,421 patients with at least one JHHC claims record in 2013 and at least 2 years of continuous JHHC enrollment between 2013 and 2015 (
To explore the sensitivity of our approach, we further divided the study population into four distinct diagnosticdriven subpopulations. These subpopulations were chosen based on the frequency of the underlying EDC data and were validated by two clinicians. The clinicians reviewed the combination of EDCs and asserted their practical use in clinical settings. These subpopulations were identified as: (1) otitis media (n=24,992 patients), (2) mental health (n=34,456), (3) musculoskeletal signs and symptoms (n=24,799), and (4) acute upper respiratory infection (URI; n=53,232).
Selection process of the study population. JHHC: Johns Hopkins Health Care; EDC: expanded diagnostic cluster.
The full study population and each subpopulation contained several predictor variables and the outcome variable. Predictors (ie, independent variables) included demographics, EDCs, RxMGs, and other health utilization variables (eg, hospitalization, care coordination) generated by the ACG system. Many of these predictors, including all EDCs and RxMGs, are categorical variables [
The outcome of interest, a binary variable, was whether or not a patient became a PHU after the base year (ie, being in the top 20% of the highest health care costs over 4 consecutive 6month periods from 2014 to 2015). The outcome variable was calculated separately in the full population and in each of the diagnostic subpopulations (eg, a patient might be considered a PHU in a subpopulation but not in the full population).
LCA was performed on the full study population and on each subpopulation separately to identify “phenotypes” (ie, classes) of disease subtypes [
The main parameters generated by LCA are the probabilities of latent class membership for each individual (ie, each patient in the mental health subpopulation; n=34,456) and the classspecific probabilities of observing each binary variable (eg, tobacco use EDC among mental health patients). These probabilities distinguish LCA from binning techniques in which each individual (eg, patient) is merely assigned a probability of belonging to an unobserved/latent class (eg, representing a specific pattern of comorbidities) based on a wellestablished statistical theory [
LCA creates latent classes that optimize minimizing the variance across individuals within each class while maximizing the variance between individuals in different classes. Moreover, LCA is a personcentered approach, does not make distributional assumptions, and works well with categorical data, making it particularly applicable to subtype identification of patients using diagnostic data such as EDCs [
LCA models with a varied number of latent classes (2 to 6 classes) were constructed using EDC, RxMG, and selected patientlevel resource utilization variables. For both the full population and the select subpopulations, 4class models were chosen because they provided the right balance between optimal model fit and interpretability of the classes. Although models with more classes (eg, 5 and 6class models) might fit the data slightly better, the interpretation of the classes becomes less clear, and often classes may differ only across a few variables. In other words, the gain in fit is not sufficient to overcome the decline in interpretability that comes from adding too many classes to the model. Additionally, LCA models with more than 6 classes did not improve the standard fit metrics, explained a very small proportion of patients, and had limited mathematical convergence, and were therefore not considered in this study.
LCA fit was measured using G^{2}, Akaike information criterion (AIC), and Bayesian information criterion (BIC) metrics; lower values of G^{2}, AIC, and BIC imply a better fit [
LCA does not bin each individual into a class but rather calculates the probability that an individual’s characteristics most closely match those of the other individuals in each class. Classes are constructed to maximize similarity of individuals’ characteristics within a class and dissimilarity of individuals across classes. For example, in this study, the LCA methodology generated four different class probabilities for each patient representing the similarity of the patient’s comorbidities (ie, mix of EDCs and RxMGs) to comorbidities of patients in each LCAderived class of the entire study population.
Once the classes were constructed via LCA and health utilization characteristics of the classes were graphically compared, we trained logistic regression models to predict PHUs in both the full population and in each subpopulation using the following variables: (13) latent class membership probabilities for 3 of the 4 classes (the class with the lowest chronic EDC/RxMG probabilities was chosen to be the reference class); (4) gender (male; reference=female); (59) race (Black, Asian, Hispanic, other, missing; reference=White); (10) medical and pharmacy coverage in 2013; (11) Medicaid eligibility; (12) number of acute care inpatient days; (13) number of acute care inpatient stays; (14) presence of frailty conditions; and (1516) likely or possibly experiencing care coordination issues (yes/no). Variables 12 to 16 were generated by the ACG system [
We also used the ACG system’s internal risk stratification functions (ie, embedded models) to predict PHU status in the full population [
All analyses, including the descriptive analysis of the full population and all subpopulations, were performed in R (v3.5.1). We used R’s basic packages for the LCA clustering [
Descriptive statistics for the full population are summarized in
Characteristics of the study populations.
Characteristic  Overall study population (N=164,221)  NonPHU^{a} population (n=155,862)  PHU population (n=8359)  



017  100,811 (61.4)  99,352 (63.7)  1459 (17.5) 

1864  62,396 (38.0)  55,666 (35.7)  6730 (80.5) 

65+  1014 (0.6)  844 (0.5)  170 (2.0) 
Age (years), mean (SD)  19.79 (17.43)  18.79 (16.82)  38.51 (18.01)  
Male, n (%)  72,418 (44.1)  69,683 (44.7)  2735 (32.7)  



White  41,219 (25.1)  38,762 (24.9)  2,457 (29.4) 

Black  53,872 (32.8)  50,993 (32.7)  2,879 (34.4) 

Other^{b}  149 (0.1)  143 (0.1)  6 (0.1) 

Missing^{c}  68,981 (42.0)  65,964 (42.3)  3017 (36.1) 



0  158,763 (96.7)  151,971 (97.5)  6792 (81.3) 

15  5,366 (3.3)  3,866 (2.5)  1500 (17.9) 

610  74 (<0.1)  20 (<0.1)  54 (0.6) 

11+  18 (<0.1)  5 (<0.1)  13 (0.2) 



0  3,690 (2.2)  3,663 (2.4)  27 (0.3) 

15  95,372 (58.1)  94,138 (60.4)  1234 (14.8) 

610  33,745 (20.5)  32,317 (20.7)  1428 (17.1) 

11+  31,414 (19.1)  25,744 (16.5)  5670 (67.8) 
^{a}PHU: persistent high users.
^{b}“Other”describes members of known race/ethnicity not equal to Asian, Hispanic, White, or Black.
^{c}“Missing” describes members with empty values for race.
LCA models with 2 to 6 classes were trained using the full population to identify the optimal number of classes. The fit statistics for these models were then calculated and compared for the full population (
A model with the lowest AIC tends to be more complex if it is not the same as the model with the lowest BIC [
The LCA models were run with 178 different EDCs and RxMGs on the full population and with the same EDCs/RxMGs on the diagnostic subpopulations, excluding the EDCs used to define the subpopulations. Examining all EDCs/RxMGs in our 4class LCA models, excluding the EDCs used to define our subpopulation, led us to very similar descriptions of each class. A caveat to this observation is that many EDCs/RxMGs had very low or very high probabilities of being observed in all classes and hence were not useful for distinguishing among classes.
Each LCA class contained itemresponse probabilities for each of the EDC/RxMG codes; however, for only a few of the EDC/RxMG codes, the probability was ≥0.4 in every class.
The selected subtype characteristics from the LCA and fractions of patients assigned to each subtype were also explored for each of the four diagnostic subpopulations (
Only a handful of EDCs clearly distinguished the four classes in each LCA model (full population and the diagnostic subpopulations). In the full population and in most of the diagnostic subpopulations, three of these classes were associated with uniformly high, moderate, or low probabilities of the EDCs. The remaining class was characterized primarily by a high likelihood of minor infections, pain, or respiratory diagnoses (
Model fit statistics for latent class analysis models with 2 to 6 classes (N=164,221).
Model  G^{2}^{a}  AIC^{b}  BIC^{c} 
2class model  5,487,702  9,113,315  9,116,888 
3class model  5,213,964  8,839,935  8,845,300 
4class model  5,088,223  8,714,552  8,721,708 
5class model  4,934,192  8,560,878  8,569,826 
6class model  4,874,634  8,501,679  8,512,419 
^{a}G^{2}: likelihood ratio/deviance statistic.
^{b}AIC: Akaike information criterion.
^{c}BIC: Bayesian information criterion.
Latent class itemresponse probabilities for the full population (N=164,221).
Latent class itemresponse probabilities for the otitis media subpopulation (n=24,992).
Latent class itemresponse probabilities for the mental health subpopulation (n=34,456).
Latent class itemresponse probabilities for the musculoskeletal subpopulation (n=24,799).
Latent class itemresponse probabilities for the acute upper respiratory infection subpopulation (n=53,232).
Logistic regression models were developed for the full population and for each subpopulation to predict PHUs from latent class membership probabilities along with demographic and health utilization characteristics of each patient. These models were trained on a randomly selected sample of 80% of the patients in the full population/subpopulation and were evaluated on a test data set with the other 20% of patients. Classification metrics for each of these models (
The LCAenabled regression model for the full population performed modestly lower than the ACG model (ie, F1score 38.6 vs 48.2); however, the LCAenabled model had fewer predictors (16 variables) than the ACG model (≥300 variables). The F1scores of the LCAenabled regression models in the subpopulations were comparable to the F1score of the complex ACG model in predicting PHUs in the full population (ie, F1scores ranging from 43.0 to 48.7 vs 48.2). Since the specificity, sensitivity, PPV, and F1score were calculated for specific thresholds, only one estimate was calculated for each of those metrics (ie, the 95% CI was not applicable).
Comparing classification metrics for predicting persistent high user/utilizer (PHU) status.
Metric  Full study population (N=164,221)  Otitis media (n=24,992)  Mental health (n=34,456)  MSK^{a} (n=24,799)  acute URI^{b} (n=53,232)  

ACG^{c}  LCALRM^{d}  LCALRM  LCALRM  LCALRM  LCALRM  
PPV^{e} (%)  48.60  38.53  44.40  39.91  42.74  41.28  
Sensitivity (%)  47.90  38.72  47.23  62.43  55.14  44.99  
F1score (%)  48.20  38.62  45.77  48.69  48.15  43.05  
Percentile (threshold)  95th (0.33)  95th (0.33)  95th (0.18)  80th (0.25)  95th (0.53)  95th (0.23)  
PHUs (%)  5.1  5.1  4.5  12.8  15.5  4.6 
^{a}MSK: musculoskeletal.
^{b}URI: upper respiratory infection.
^{c}ACG: Adjusted Clinical Groups; latent class analysis results not included in the model.
^{d}LCALRM: latent class analysislogistic regression model; latent class probabilities included as predictors in the model.
^{e}PPV: positive predictive value.
Odds ratios (ORs) of the LCAenabled regression models predicting PHUs in the full population and in each of the diagnostic subpopulations were calculated separately (
PHUs are defined as the patient population who stay in the highest deciles of health care costs and/or utilization for multiple years [
Our study demonstrated the use of nontraditional statistical clustering methods such as LCA to facilitate the automated development of diagnostic and medication probability classes that can be effectively used in traditional logistic regression models to predict PHUs, without the need for complex predictive models. Two of our study findings specifically support the use of LCA in predicting PHUs. First, the F1score of the LCAenabled logistic regression was comparable to that of the complex predictive model despite having a fraction of the variable predictors (16 vs ≥300 variables). Second, the ORs of the LCAderived classes were much higher (ranging from 22 to 135) than those of the other variables (ranging from 0.4 to 3.0) used in the logistic regressions. Therefore, LCA can be an efficient (ie, unsupervised process requires minimal manual effort), effective (ie, high ORs in the predictive models), and usable (ie, avoiding complex predictive models) method for predicting PHUs in different settings.
The mix of LCA classes may differ among PHUs of different health systems. For example, our study population of 164,221 patients included 130,711 members enrolled in a special Medicaid insurance plan (ie, Johns Hopkins Priority Partners) targeting mothers and children. Thus, as 79.6% of the study population were enrolled in this Medicaid program, the average age of the full population was close to 20 years. Consequently, the most common EDCs for three of the four diagnostic subpopulations included pediatric conditions such as ear problems [
A few prior studies have explored the use of LCA and other classifying techniques to improve the prediction of PHUs. One study focused on US older and middleaged patients and grouped them using the Medical Expenditure Panel Survey data set to explore high to moderate utilization rates [
Health care providers increasingly use risk stratification tools to manage their patient populations. However, providers often do not have access to insurance claims data and use local EHRs to risk stratify patients and predict PHUs [
Our study has several limitations. First, the results of our LCA approach, and the improvement of the PHU prediction, may not generalize to other populations (eg, older adults, Medicare), settings (eg, inpatient only), or data sources (eg, EHRs). Future research should explore the use of LCA in new populations and settings using alternate data sources. Second, our specific definition for PHU (ie, percentile of cost and time period) may not fit all populations. The risk stratification research community should offer a harmonized definition of PHU so that various research findings on PHUs can be compared effectively to establish generalizable evidence. Third, results of the logistic regression should be interpreted with caution as race and ethnicity are likely to be closely linked to differences in health care coverage and quality rather than being directly related to PHU [
A small percentage of patients use most of the health care services continuously over extended periods. We used LCA, an unsupervised clustering approach, to automate the process of extracting classes of comorbidity and medication probabilities for individual patients that can be effectively used in predicting PHUs. The latent classes highlight broad differences in health care utilization patterns among groups of people, while also providing a way to condense critical information into a smaller set of variables to simplify the PHU prediction model and improve its interpretability. From a care management perspective, the LCA and PHU prediction models provide care managers with insights on specific resource utilization variables that are strongly associated with PHU. Future studies should investigate the value of LCAderived classes for predicting PHUs in other health care settings with potentially different underlying populations.
Descriptive statistics for the otitis media subpopulation (n=24,992).
Descriptive statistics for the mental health subpopulation (n=34,456).
Descriptive statistics for the musculoskeletal subpopulation (n=24,799).
Descriptive statistics for acute upper respiratory tract infection (URI) subpopulation (n=53,232).
Model fit statistics for latent class analysis (LCA) in diagnostic subpopulations.
Odds ratios of predictors in the LCAenabled logistic regression model predicting persistent health care users/utilizers (PHUs) in the full population (N=164,221).
Logistic regression odds ratios for the otitis media subpopulation.
Logistic regression odds ratios for the mental health subpopulation.
Logistic regression odds ratios for the musculoskeletal subpopulation.
Logistic regression odds ratios for the acute upper respiratory infection subpopulation.
Adjusted Clinical Groups
Akaike information criterion
Bayesian information criterion
Consolidated Standards of Reporting Trials
emergency department
expanded diagnostic cluster
electronic health record
Johns Hopkins Health Care
latent class analysis
odds ratio
persistent high user/utilizer
positive predictive value
prescriptiondefined morbidity groups
upper respiratory infection
We acknowledge the contributions of Sheri Maxim, Jonathan Thornhill, Jason Lee, Hong Kan, and Thomas Richards to this project. This project was funded by the Johns Hopkins APL’s National Health Mission Area (NHMA) Independent Research and Development (IRAD) program.
HK and MM codirected the research project. RR and SH analyzed the data. HC provided analytical insight and calculated claims costs. HK, MM, HB, and JW reviewed and interpreted the results. HK, RR, and MM drafted the manuscript. All authors reviewed and contributed to the final manuscript. HK prepared the manuscript for submission.
None declared.