We identified ten different scoring systems assessing the risk of in-hospital mortality or admission to an ICU in acutely admitted patients. None of these systems complied with the criteria for the highest levels of scientific evidence, but all seemed somewhat scientifically sound and could perhaps be used in a MAU. Most of the scoring systems use primarily vital signs as variables in the attempt to stratify the patients. The SCS and HOTEL scores use some subjective data (e.g. dyspnoea). The ALT and RLD use biochemical analyses and therefore cannot be calculated on presentation of the patient but have to await the analyses of blood tests. Data for calculating the other eight scores are easily obtained at presentation (except perhaps the EKG needed for calculation of SCS and HOTEL) and the score can be calculated at this early point in time. The WPS, EWS, TTS, SCS, GS, REMS and RAPS use an aggregate weighted score where increasing abnormality in the variables results in an increased score (e.g. respiratory rate ≤ 19 scores 0, 20-21 scores 1 and ≥ 22 scores 2). The RLD and ALT uses a mathematical formula to calculate the risk (e.g. -10.192 + (-0.013 * gender) + (5.717 * mode of admission) + (0.018 * urea) etc.). The HOTEL score simply adds one point to each of the criteria that are outside the defined interval (e.g. systolic blood pressure < 100).
The ability of the scoring system to separate the patients with increased risk for meeting the specified outcome (e.g. mortality) is determined by the discriminatory power. The RAPS, RLD and EWS however, do not present this in their article. As for the other seven scoring systems all have an AUROC above 0.657, indicating at least a fair discriminatory power. Both the HOTEL score and the SCS reach impressive AUROC's during both development and validation.
Calibration, i.e. the agreement between the predicted and the observed outcome across all patients stratified into subgroups, was not reported systematically. In fact, only four articles (REMS, HOTEL, WPS and partly ALT) presented data on this subject. In REMS the calibration was poor, but it was reported as satisfactory to good in the other studies.
A developed scoring system can only be used if it has been validated (i.e. applied to a new cohort of patients). Otherwise, the discriminatory power and calibration can be falsely elevated. There are several ways to validate a scoring system, but an external validation (i.e. at another location than were the system was developed) in a separate cohort is preferable. However, only three of the systems were validated externally (REMS, RAPS and TTS) and one scoring system was not even validated locally (GS). As described by McGinn et al scoring systems can be categorized into levels of evidence according to their method of validation. The scoring systems in this study to reach the highest levels of evidence (level 2) were "Track and Trigger" systems, also used in activation of medical emergency teams. All other systems reached level 3 except the GS which only reached level 4 as it is not validated but only derived.
Most of the parameters used to calculate the scores are straight-forward, and the calculation does not seem complicated, perhaps with the exception of RLD and ALT which use a complicated formula derived from regression analyses. However, as none of the systems presents reliability data, it is unknown which level of inter-observer reliability is reached. In some of the scoring systems, a few parameters bear a risk of increased inter-observer variability, e.g. if the patients have dyspnoea (SCS), is able to stand unaided (SCS and HOTEL) and perhaps respiratory rate (EWS, TTS, WPS, SCS, RAPS and REMS).
But the main question is if we have any use for scoring systems in today's world of medicine. One thing is that scoring systems are capable of predicting mortality and ICU admissions, but does this have any clinical importance? Indeed one could argue that scoring systems bring little extra information to the clinical judgment made by all doctors on their first encounter with a patient. An example of this was the SUPPORT trial that showed that providing the physicians with objective outcome predictions, did not significantly change physician's attitudes and behavior[22, 23] when treating their patients. In order to clarify this, we need studies comparing clinical assessment of patients with the combined effect of clinical assessment and the use of scoring systems. This has rarely been done in the Emergency Department, but we know from the critical care environment that physicians are good at prediction of mortality in patients by use of their clinical assessment, but that the use of scoring systems can support their judgment[1, 24–26]. But even if it is eventually proven that scoring systems improve assessment of mortality, one could argue that the introduction of the scoring system itself forces the clinician to reflect on the risk of the patient, and that this carries all of the effect. However, the use of scoring systems will perhaps be able to identify patients at risk that might be overlooked by the medical staff and thereby improve their treatment, and this alone could justify their existence.
But most scoring systems are developed for use on groups of patients and not on an individual level. However, this fact is often overlooked by our inexperienced colleagues and the score is applied directly to the patient. This runs the risk of misclassifying the patient and thus directing therapy in, perhaps, a wrong direction. As none of the scoring systems presented in this paper have had an impact analysis performed, we do not know if their implementation will affect clinical therapy. If we are to use scoring systems as routine part of our clinical work, much more research is needed. As a result of this, we, at the moment, cannot use any of these systems on an individual level.