Development and internal validation of an algorithm for estimating mortality in patients encountered by physician-staffed helicopter emergency medical services

Background Severity of illness scoring systems are used in intensive care units to enable the calculation of adjusted outcomes for audit and benchmarking purposes. Similar tools are lacking for pre-hospital emergency medicine. Therefore, using a national helicopter emergency medical services database, we developed and internally validated a mortality prediction algorithm. Methods We conducted a multicentre retrospective observational register-based cohort study based on the patients treated by five physician-staffed Finnish helicopter emergency medical service units between 2012 and 2019. Only patients aged 16 and over treated by physician-staffed units were included. We analysed the relationship between 30-day mortality and physiological, patient-related and circumstantial variables. The data were imputed using multiple imputations employing chained equations. We used multivariate logistic regression to estimate the variable effects and performed derivation of multiple multivariable models with different combinations of variables. The models were combined into an algorithm to allow a risk estimation tool that accounts for missing variables. Internal validation was assessed by calculating the optimism of each performance estimate using the von Hippel method with four imputed sets. Results After exclusions, 30 186 patients were included in the analysis. 8611 (29%) patients died within the first 30 days after the incident. Eleven predictor variables (systolic blood pressure, heart rate, oxygen saturation, Glasgow Coma Scale, sex, age, emergency medical services vehicle type [helicopter vs ground unit], whether the mission was located in a medical facility or nursing home, cardiac rhythm [asystole, pulseless electrical activity, ventricular fibrillation, ventricular tachycardia vs others], time from emergency call to physician arrival and patient category) were included. Adjusted for optimism after internal validation, the algorithm had an area under the receiver operating characteristic curve of 0.921 (95% CI 0.918 to 0.924), Brier score of 0.097, calibration intercept of 0.000 (95% CI -0.040 to 0.040) and slope of 1.000 (95% CI 0.977 to 1.023). Conclusions Based on 11 demographic, mission-specific, and physiologic variables, we developed and internally validated a novel severity of illness algorithm for use with patients encountered by physician-staffed helicopter emergency medical services, which may help in future quality improvement. Supplementary Information The online version contains supplementary material available at 10.1186/s13049-024-01208-y.


Background
Since the release of the Acute Physiology and Chronic Health Evaluation (APACHE) score in 1981 [1], several prognostic scoring systems have been developed to assess the severity of disease in critically ill patients treated in the intensive care unit (ICU) [2,3].Risk scores have also been developed for other purposes, such as the assessment of the severity of injury or a given disease, facilitation of triage decisions and to indicate the need for interventions [2].ICU risk scores may be used to detect and quantify organ failure and to provide a statistical estimation of outcomes for quality improvement, audit and benchmarking purposes [3][4][5].The APACHE score, Simplified Acute Physiology Score (SAPS) and Mortality Prediction Model (MPM) are examples of the latter [1,3,6,7].
Care of critically ill patients is often initiated in prehospital settings, and in certain patient populations this care is paramount for patient outcomes [8][9][10][11][12][13].Even so, the risk stratification tools used in the pre-hospital setting are mostly limited to disease-specific risk scores and early warning scores (EWS) used for triage decision making, identifying critical illness and assessing the levelof-care requirements for the receiving centre [14][15][16][17].Different EWS have shown varying values in predicting short-term adverse outcomes in pre-hospital settings, with decreasing predictive abilities during longer followup [13,15,17,18].We currently lack a uniform mortality risk model for the wide range of critically ill pre-hospital patients attended by physician-staffed units that could allow for the estimation of standardised mortality ratios (SMR) in benchmarking and for risk stratification in epidemiological studies.Using a national helicopter emergency medical services (HEMS) database, we developed and internally validated a uniform risk algorithm for predicting mortality in patients treated by physicianstaffed HEMS (P-HEMS) units based on essential physiological variables and additional factors independent of treatment.

Study setting
To develop a mortality model, we conducted a multicentre retrospective observational register-based cohort study of patients encountered by the national Finnish helicopter emergency medical services (FinnHEMS) between January 2012 and September 2019.The FinnHEMS organisation is publicly funded and comprises six operational units, of which five are physician-staffed and one is staffed only by paramedics.The physician-staffed units operate within the catchment areas of the five Finnish university hospitals (see Additional file 1).The paramedic-staffed unit operates solely in the sparsely populated district of Lapland.The service areas cover most of the population of Finland [19].The fleet includes Airbus 135 and 145 helicopters, as well as rapid response ground vehicles that are used in short-range missions and whenever the weather conditions are not suitable for aviation.
Finnish P-HEMS units mainly encounter critically ill or injured patients who require pre-hospital critical care.The P-HEMS units are dispatched based on uniform predefined criteria by the national emergency response centre agency.Ambulance crews can also request P-HEMS response.The major categories for P-HEMS dispatch include major trauma, cardiac arrest, and impaired level of consciousness.The physician can cancel or deny the mission if the patient is not considered able to benefit from the care provided by the P-HEMS based on the information provided by the dispatcher or ambulance crew.The HEMS physicians are mostly experienced anaesthesiologists.Secondary transfers are rare, but the units can be dispatched to medical facilities or nursing homes for primary missions.The characteristics of the Finnish HEMS, including the relatively low utilization of helicopter transportation of the patients, have been recently described [20].
We report our findings in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement [21].Ethical approval was not required for this study, as it was retrospective in nature, exclusively utilizing non-original register-based data that were neither generated nor collected specifically for this research and involved no interventions or direct contact with study participants.
The philosophical underpinnings of this research are based on addressing the current gaps in risk assessment tools for pre-hospital critical care.This research paradigm stresses the importance of evidence-based medicine and using predictive analytics to improve prehospital care delivery.The theoretical framework builds upon established risk scoring systems used in critical care settings and expands their application to the pre-hospital environment.By adhering to transparent reporting standards, we aim to ensure the robustness and applicability of the developed risk algorithm.

Participants and study outcome
We included patients aged 16 years and over who were assessed by P-HEMS units.Patients treated by the paramedic-staffed unit operating in Lapland were not included due to differences in staffing and dispatch criteria [20].Patients from the autonomous region of Åland were excluded as the local health care system functions in separation from the mainland.No other eligibility criteria were applied (Fig. 1).Our main outcome was overall mortality within 30 days of encountering the P-HEMS unit.This was chosen in preference to mortality during shorter follow-up as we consider long-term survival to be an outcome of greater importance for both the individual and the society.

Data collection
The research material was derived from the FinnHEMS database (FHDB), covering every HEMS mission in Finland during the study period.Since its adoption for nationwide use in 2012, the operational and clinical data from every FinnHEMS mission have been registered and stored in the FHDB in accordance with generally accepted guidelines and registry templates for pre-hospital data collection [20,[22][23][24].The FHDB contains records with a total of more than 170 variables (see Additional file 2).The data are manually entered into the database by a member of the FinnHEMS unit that attended the mission.Input errors with obvious abnormal measures are disallowed by the FHDB registration system; however, errors within the normal range of each measure are not detected.For physiological parameters, only the first measurements after the HEMS arrival were included.
Population registry data provided by the Finnish Digital and Population Data Services Agency were used to obtain information about the main outcome and verify the age and sex of the patient.A personal identity code offered by the Digital and Population Data Services Agency links a nationwide population registry with healthcare software systems.

Candidate predictor variables
From the FHDB, 14 candidate variables were selected for analysis based on the consensus of the authors: patient age, number of patients in a single mission, patient sex, Glasgow Coma Scale (GCS), HEMS vehicle type (helicopter or ground unit), cardiac rhythm, respiratory rate, systolic blood pressure, oxygen saturation, heart rate, patient category, time from emergency call to HEMS arrival, time from emergency call to arrival of the first emergency medical services (EMS) unit and whether the mission was located in a medical facility or nursing home.The consensus was reached by employing a collaborative decision-making approach among the authors, who independently selected potential variables from a comprehensive list of variables.Subsequently, the selections were compiled, and similarities and differences were assessed collectively.Through iterative discussion and consensusbuilding, agreement was reached.

Missing data
We dealt with missing data by using multiple imputations with chained equations [25].This method estimates missing data over multiple iterations to create complete datasets for analysis.We performed 10 iterations to generate 30 complete datasets.The differences between the patients with and without missing data were used to identify further variables to be included in the imputation, along with the candidate predictor variables and the primary outcome.In two cases, clearly erroneous data were observed, which were treated as missing data.

Model development and predictor effects
We analysed how patient characteristics relate to the outcome using Mann-Whitney U test.Then, we used multivariate logistic regression to estimate the effects of the studied predictors on the outcome in each of the 30 imputed datasets.To combine these results from various datasets, we used Rubin's rule, a commonly used formula to combine results from multiply imputed data [25].With this method, we obtained the final pooled estimate for the effects of the predictors.
To avoid the excessive influence of extreme values, we applied winsorization method to all continuous variables (except for the GCS) adjusting the values by 1% at both ends.Any values below or above these limits were set to the limit itself.Additionally, we used a technique called restricted cubic splines to examine how continuous predictor variables interact with the outcome in a non-linear way.We used three knots for GCS and four for the remaining continuous variables.To assess the statistical significance of individual variable effects on the outcome, we utilised the Wald test.

Algorithm formation
We aimed to develop a tool that could be used with actual pre-hospital data, in which missing values are frequent.To allow the use with incomplete data, we did not only create a prediction model but also multiple additional models with different combinations of the same predictor variables that we used in the original model.These predictor combinations were designed so that each additional model excluded one or more of the candidate predictor variables with the most missing data.
All these models were then combined into an algorithm, the Critical HEMS Algorithm for Mortality Prediction (CHAMP).The algorithm allows for the caseby-case selection of a tailored model for each individual based on the available variables.The CHAMP algorithm automatically selects the model with the most available variables for each patient.All models were built in the same manner as the original model with no missing variables (referred to as the full model later in the text).

Assessing the performance
The discriminative abilities of both the models and the algorithm were investigated using the area under the receiver operating characteristic (AUROC) curve.Calibration was evaluated by fitting a calibration curve and calculating the slopes and intercepts for the predicted probability of the outcome.The slope of one and the intercept of zero would suggest ideal calibration.Overall performances were assessed using the Brier score, a metric used to measure the accuracy of predictions, encompassing both discrimination and calibration.It ranges from zero to one, with zero indicating perfect accuracy.In addition, The Hosmer-Lemeshow test was used to test the calibration of the algorithm, as nonsignificant values imply a good fit.
For individual models, all performance estimates were calculated in the imputed sets and pooled using Rubin's rules, whereas the performance of the CHAMP was calculated for the original population to illustrate a more authentic user experience.We used a generalised additive model (GAM) risk plot, a receiver operating characteristic plot and a risk decile plot to visualise the performance and calibration of the algorithm.

Sensitivity analysis and internal validation
As the studied population included a notable degree of cardiac arrest patients, a specific subgroup known to have high mortality [26], we performed a planned sensitivity analysis excluding patients with cardiac arrest as the primary dispatch code to assess the robustness of the results.
Internal validation was performed by calculating the optimism of each performance estimate using the von Hippel method with four imputed sets each containing 250 bootstrapped samples.

Study population
During the study period, 36 633 patients were encountered by the HEMS.After exclusions, 30 186 patients were included in the final analysis (Fig. 1).The median patient age was 60 [IQR 39 to 73] years, and 65% of the patients were male.The 30-day mortality rate was 30% (n=8611).A total of 11 971 (40%) of the patients included in the final analysis had missing data for at least one of the studied predictors or the outcome (see Additional file 3: Table S1).The study cohort is described in detail according to the occurrence of the main outcome in Table 1.

Model development
The variables were first screened for compatibility for the modelling and ones with too much missing data, too rare occurrences, or too few deaths per category were dropped.Of the 14 initially selected candidate variables, 11 predictors (systolic blood pressure, heart rate, oxygen saturation, GCS, sex, age, HEMS vehicle type, whether the mission was located in a medical facility or nursing home, cardiac rhythm, time from emergency call to HEMS arrival and patient category) were included in the full model (Fig. 2).We assessed the significance of their effect on the outcome with the Wald test (Table 2).The odds ratios for the selected categorical variables are listed in Table 3.We allowed nonlinear effects by using restricted cubic splines for continuous variables (Fig 3).
As we identified a notable amount of missing data for some variables (see Additional file 3: Table S2), we created 31 additional variations of the model to allow case-by-case exclusion of different combinations of the five variables with the most missing data: systolic blood pressure, heart rate, oxygen saturation, GCS and cardiac rhythm (see Additional file 4).They are designed to be applied whenever missing data for these variables are encountered.Combined, the 32 models form the CHAMP algorithm (Fig. 4).The CHAMP algorithm chooses the most suitable model for each patient depending on the available predictor variables.For example, if data for heart rate were missing, the algorithm would use the model that does not utilise heart rate as a predictor variable.The CHAMP algorithm can be accessed and the 30-day mortality estimate calculated using a calculator software designed for this purpose [32].

Model performance, sensitivity analysis and internal validation
All the performance measures and optimism corrections based on internal validation for the CHAMP algorithm are presented in Table 4 and Fig. 5.For individual models, we observed AUROCs ranging from 0.868 to 0.927 and Brier scores from 0.125 to 0.093, depending on the excluded variables.Calibration intercepts were between -0.003 and 0.000, and slopes between 0.996 and 0.999 (see Additional file 5).The results of the sensitivity analysis excluding cardiac arrest patients differed from those of the primary analysis (Table 4).

Key findings
We analyzed data from 30,186 patients encountered by P-HEMS units, revealing a 30-day mortality rate of 30%.Notably, a substantial proportion of patients had missing data for predictor variables, as is often the case with pre-hospital data.After selecting and evaluating predictor variables, we developed a total of 32 prediction models.These models were then combined to form the Critical HEMS Algorithm for Mortality Prediction (CHAMP).With CHAMP, 30-day mortality in patients encountered by P-HEMS can be estimated using 11 easily obtainable variables.
For the full model with all the 11 variables, the analysis revealed that cardiac rhythms VF, VT, asystole, or PEA, indicated higher mortality risk.Mission location and time to HEMS arrival initially showed association with mortality risk, but these diminished in multivariate analysis.Type of HEMS vehicle and patient sex demonstrated weaker associations with mortality.Patient categories exhibited varying associations, with cardiac arrest and stroke indicating the highest mortality risk.Mortality increased with age, extreme systolic blood pressure values, and decreasing heart rate, oxygen saturation, and Glasgow Coma Scale (GCS) scores.
Following internal validation, we observed a promising preliminary performance with excellent discrimination and calibration.The sensitivity analysis without cardiac arrest patients revealed that the model exhibited slight variation but still performed acceptably.If our algorithm is externally validated, it can be used to calculate SMR in the patient population encountered by P-HEMS and possibly other EMS units and would offer a mortality estimation of patients based on initial assessment independent of pre-hospital interventions.To improve the algorithm's accessibility, we developed a calculator software that can be accessed online [32].model was further validated externally in the Dutch EMS system, achieving an AUROC of 0.74 [35].In contrast to the original study and the previous external validation, the Dutch cohort included P-HEMS missions, although these covered only a small proportion (0.7%, n=22) of all patient encounters.We believe that the model proposed by Seymour et al. could be used in parallel with ours, as it serves to predict the need for intensive care, whereas our model focuses on mortality.As Seymour et al. pointed out, their model is meant as a triage tool to be applied at the scene and needs to be simple.As our algorithm is not intended to be calculated at the scene, simplicity was not our priority, which allowed us to create a more complex model while still using obtainable variables.
The studied variables comply with the reporting policies equally agreed upon within the HEMS and EMS communities [22][23][24]36].The algorithm's ability to variate according to different combinations of missing data enables its utility in the statistically challenging prehospital field where imperfections in data collection and availability are unavoidable.Due to the very nature of pre-hospital critical emergency medicine, certain physiological measures will not be achievable in every mission, even with best practices.For example, cardiac arrest patients, who constitute a major patient population for most HEMS teams, present with vital functions lacking, and some physiological parameters, such as oxygen saturation, thus being unmeasurable.In addition, pre-hospital settings often involve dynamic and unpredictable situations, and data collection may not always be feasible or prioritized amid the urgency of patient care.P-HEMS teams frequently operate with limited resources.Technological issues such as device malfunctions, connectivity problems, or user interface difficulties can contribute to missing data.

Implications
It is crucial to perceive that CHAMP is not intended to provide prognostication for individual patients but rather to describe demographics of a group.For epidemiological research, it may be used to risk stratify a population of interest or to match the baseline characteristics of a control arm to those of an intervention arm, for example.SMR is the ratio of observed to predicted mortality.Predicted mortality, in turn, can be estimated with CHAMP.Using SMR as a performance measure enables benchmarking, quality assurance and prioritising targets for improvement.
Alongside external validation, another focus of future research should be the CHAMP algorithm's conformity to changing registration policies and adaptation to future innovations as new clinical predictors and measurement methods are identified and adopted for pre-hospital critical emergency medicine.

Strengths and limitations
We note several study strengths.The FHDB is large and includes data collected since 2012; data are collected systematically from multiple units.The HEMS units contributing to the database serve the whole of Finland and are an integral part of the national publicly funded healthcare system.The study sample included every P-HEMS mission in Finland during the study period.Our study has some limitations.P-HEMS missions constitute only a small proportion of all pre-hospital patient encounters.Although the CHAMP algorithm is designed to be used for patients treated by P-HEMS, some selection bias is possible, since the criteria for P-HEMS activation in Finland may vary from those in different health care systems.
A sensitivity analysis without cardiac arrest patients showed inconsistency in the results, most distinctly with respect to discrimination performance, suggesting that the applicability of the algorithm might be limited in settings with a divergent incidence of cardiac arrest.However, cardiac arrest patients form a substantial proportion of the patients treated by most P-HEMS [37][38][39].
We identified a high proportion of missing data and excluded variables with more than one-third of the data missing.To allow multiple imputation for the remaining variables, we assumed that the data were missing at random, which may be debated, but this bias may be reduced as the algorithm selects a model that accounts for some of the missing variables.The data were collected and entered manually into the database, which may have resulted in erroneous measurement and registration in addition to problems with inter-rater reliability.Nevertheless, the reliability of the FHDB has recently been evaluated and found to be acceptable for data registration [40].

Conclusions
Based on a comprehensive and systematically gathered database, we developed and internally validated a novel prediction algorithm for 30-day mortality prediction in patients encountered by a P-HEMS unit.The algorithm combines 32 prediction models using 11 easily obtainable variables: systolic blood pressure, heart rate, oxygen saturation, GCS, sex, age, HEMS vehicle type, whether the mission was located in a medical facility or nursing home, cardiac rhythm, time to HEMS arrival and patient category according to dispatch code.If the current algorithm in time proves successful in external validation, it could be a useful research and quality assurance tool.

Fig. 1
Fig. 1 Study cohort selection process.HEMS, helicopter emergency medical services.*Missions including patients located in the autonomous region of Åland

Fig. 4
Fig.4 Development of the CHAMP.The model variations include six to eleven predictors per model, with one or more of the five predictors with the most missing data excluded.CHAMP, Critical HEMS Algorithm for Mortality Prediction

Fig. 5
Fig. 5 Calibration of the CHAMP (Critical Algorithm for Mortality Prediction) algorithm (solid line) with 95% confidence interval (shaded grey area) in the original nonimputed population.The line was fitted using the generalised additive model (GAM).The dashed line represents ideal calibration

Table 1
Study population characteristics [33]rpretationSeymour et al. studied the ability of pre-hospital factors easily obtainable at the scene to predict development of critical illness, defined as severe sepsis, delivery of mechanical ventilation or death at any point during hospitalization[33].A development cohort consisted of patients encountered in Washington, USA, by either basic or advanced life support trained EMS and included

Table 1 (
continued) Data are median [IQR] or n (%).AFib Atrial fibrillation; AFlut Atrial flutter, A, Atrioventricular; EMS Emergency medical services, GCS Glasgow Coma Scale, HEMS Helicopter emergency medical services,PEA Pulseless electrical activity, SVES Supraventricular extrasystole, VES Ventricular extrasystole, VF Ventricular fibrillation, VT neither physician-staffed ground EMS nor HEMS units, thus differing from our study setting.Patients with trauma or cardiac arrest were excluded, both of whom form a substantial proportion of patients treated by many HEMS systems.Based on their findings, a score was created to calculate the risk for critical illness, including patient sex, age, respiratory rate, oxygen saturation, systolic blood pressure, heart rate, GCS and nursing home

Table 2
Univariate and multivariate Wald statistics for predictor variables HEMS Helicopter emergency medical services, PEA Pulseless electrical activity, VF Ventricular fibrillation, VT Ventricular tachycardia a Rapid response vehicle + other b Other than stroke

Table 3
Univariate and multivariate odds ratios for selected categorical predictor variables for the full model HEMS Helicopter emergency medical services, PEA Pulseless electrical activity, VF Ventricular fibrillation, VT Ventricular tachycardia a Rapid response vehicle + other The association between 30-day mortality and selected continuous predictor variables.GCS, Glasgow Coma Scale; HEMS, helicopter emergency medical services location as predictors, many of which we found to have predictive value in our study.Seymour et al. reported a promising performance in internal validation with an AUROC of 0.77 and a Brier score of 0.04.However, the model's applicability to HEMS systems may be of limited value, for the reasons discussed.The model's discrimination was assessed by Kievlan et al. in a 2016 external validation study that reported an AUROC of 0.73 [34].The

Table 4
Performance and internal validation of the CHAMP algorithm along with the results of a sensitivity analysis without cardiac arrest patients