Real time self-rating of decision certainty by clinicians: a systematic review
Abstract
Background
We sought to establish to what extent decision certainty has been measured in real time and whether high or low levels of certainty correlate with clinical outcomes.
Methods
Our pre-specified study protocol is published on PROSPERO, CRD42019128112. We identified prospective studies from Medline, Embase and PsycINFO up to February 2019 that measured real time self-rating of the certainty of a medical decision by a clinician.
Findings
Nine studies were included and all were generally at high risk of bias. Only one study assessed long-term clinical outcomes: patients rated with high diagnostic uncertainty for heart failure had longer length of stay, increased mortality and higher readmission rates at 1 year than those rated with diagnostic certainty. One other study demonstrated the danger of extreme diagnostic confidence – 7% of cases (24/341) labelled as having either 0% or 100% diagnostic likelihood of heart failure were made in error.
Conclusions
The literature on real time self-rated certainty of clinician decisions is sparse and only relates to diagnostic decisions. Further prospective research with a view to generating hypotheses for testable interventions that can better calibrate clinician certainty with accuracy of decision making could be valuable in reducing diagnostic error and improving outcomes.
Introduction
Good decision making represents a key step in providing high-quality healthcare to patients.1 However, clinicians typically have to make decisions with incomplete data in an environment beset with uncertainty.2 A 2015 Institute of Medicine report stated ‘nearly all patients will experience a diagnostic error in their lifetime, sometimes with devastating consequences’.3 The act of diagnostic calibration is the process by which a clinician’s confidence in the accuracy of their diagnosis aligns with their actual accuracy.4–6 This alignment of confidence and accuracy is applicable to decision making in every sphere of medicine, provided that both confidence and accuracy are measured. The importance of such alignment is in ensuring that both under- and over-confidence do not adversely affect decision making.
Unfortunately, the explicit measurement of confidence or certainty for a given medical decision is rare and generally supressed; being uncertain can lead to criticism and feelings of vulnerability as well as possible medico-legal ramifications.2,7 There is also an implicit bias of conflating high confidence with high competence. Indeed, a recent essay highlighted that speaking with confidence or assertiveness is sometimes prioritised in medical school, perpetuating over-confidence ahead of humility and appreciation of diagnostic uncertainty.8
Most studies which examine the relationship between self-rated confidence and outcomes such as diagnostic accuracy tend to administer retrospective questionnaires that suffer from recall bias or assess confidence in structured vignettes away from the ‘real-life’ clinical situations.5,9–11 While vignettes may be a reasonable proxy for a clinician’s performance in real life, they cannot tell us about the association between decision certainty and outcomes such as length of stay, readmission and mortality. There have been several recent systematic reviews on uncertainty with regard to how we define or measure it, the tolerance of uncertainty and its impact on staff.12–15 However, there have been no reviews to date focusing specifically on measuring decision certainty in real time during actual clinical practice. We therefore sought to establish to what extent decision certainty has been measured in real time and whether high or low levels of certainty correlate with clinical outcomes.
Methods
This review was prepared according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement and registered in the online PROSPERO database (CRD 42019128112) prior to data extraction.16
Review question
Our primary question was how many studies have been published that measure the self-rated certainty for a real time medical decision by a clinician (of any grade) in real clinical practice. Secondary questions for studies identified by the primary research question included: what scale this certainty is measured on, what is the distribution of certainty ratings and whether the certainty level is associated with any effect on clinical processes or outcomes.
Data sources and searches
We performed a comprehensive electronic database search using Medical Subject Headings (MeSH) and free-text terms for various forms of the following three terms: decision making (anywhere in the full study record), certainty (in the title) and clinician (in the title). These three terms (and their synonyms) were combined using ‘AND’ so that all three needed to be present for a study to be retrieved. The exact search strategy is listed in supplementary material S1. The following electronic databases were searched from inception to 01 February 2019: MEDLINE, Embase and PsycINFO.
Eligibility criteria
Publications were selected for review if they satisfied the following inclusion criteria: English language and prospective study that included real time self-rating of the certainty of a medical decision by a clinician (of any grade). Real time was defined as occurring at the time of documentation of the clinical encounter. Synonyms for certainty such as confidence were also accepted. Exclusion criteria included articles in languages other than English, non-study records (eg letters, editorials, case reports, reviews), studies using vignettes or questionnaires (even if the vignette was of a real patient scenario). Any study in which a clinician made a prediction but did not additionally self-rate their certainty in that prediction was also excluded eg doctor predicting likelihood of sepsis as 70% in a patient but not explicitly rating their certainty in that prediction.
Data extraction and quality assessment
Two authors (Myura Nagendran (MN), Yang Chen (YC)) independently screened titles and abstracts for potentially eligible studies. After removal of clearly irrelevant records, full text reports were reviewed for eligibility. Final decisions on inclusion were made by consensus between MN and YC. Study risk of bias was assessed by MN and YC using the Critical Appraisal Skills Programme (CASP) checklist for cohort studies or the Cochrane Risk of Bias tool for randomised trials.17,18 Discrepancies were resolved by discussion between MN and YC to arrive at a consensus decision. No formal quantitative synthesis was planned.
Results
The electronic search was performed on 01 February 2019. Flow of records is detailed in Fig 1. After screening, 17 full-text articles were assessed of which eight were excluded immediately. One full-text article could not be retrieved despite numerous efforts (the journal ceased publication in 1999) but we included the abstract that is still available.19 Therefore, nine studies remained for inclusion after full-text review.19–27 Overall risk of bias was high in all studies, including the only randomised trial we included (see supplementary material S2).
Study characteristics
Study characteristics appear in Table 1. Effective sample size ranged from 68 to 1,996 (median 470, interquartile range (IQR) 276 to 1,538). The decision to which a certainty self-rating was applied related to diagnosis in every study. In four studies, the certainty scale applied to a specific diagnosis (heart failure21,23,26 and pneumonia25).
Certainty scales
Every study except one had a scale ranging from uncertain to certain (the exception ranged from high certainty of not heart failure to high certainty of heart failure).21 Two studies used a four-part qualitative scale (either certain, rather certain, rather uncertain or uncertain; and certain, probable, suspected or unknown).22,27 One studied provided a numerical adjunct to a three-part confidence rating on the likelihood of diagnosis (high >50%, medium 20–50% or low <20%).25 Two studies used a one to 10 scale,20,24 one used a one to seven scale21 and three studies allowed a self-rating anywhere from 0–100.19,23,26 Certainty was dichotomised from the original scale prior to analysis in three studies.22,23,26
General findings
Study findings appear in Table 2. Extreme certainty ratings were reported in two studies.23,26 McCullough et al found that clinicians rated the likelihood of acute heart failure as 0% in 232 patients (19 of whom were eventually diagnosed with heart failure) and 100% in 109 patients (5 of whom were eventually diagnosed without heart failure).26 Green et al reported that over a quarter of all patients (26%) with dyspnoea were rated as having 0% likelihood of acute heart failure.23 The difference between certainty ratings of junior trainees and senior doctors was examined by only one study;20 the acute abdominal pain (AAP) study group found that diagnostic accuracy between the two groups was comparable at 44 and 43% respectively, but trainees were less certain than seniors for urgent vs non-urgent diagnosis (trainee median 7 (IQR 6 to 8) versus senior median 8 (IQR 7 to 9)). Both groups recommended imaging in 77% of patients despite seniors being more certain about their diagnoses.
Buntinx et al studied chest pain in general practice with a follow-up period occurring between 2 weeks and 2 months after the index clinical encounter.27 The proportion of diagnoses rated certain or probable increased from 74 to 88% with the benefit of follow-up information. The only randomised trial we identified focused on assessing the value of providing brain natriuretic peptide (BNP) measurements to clinicians in the intervention group once they had rated their certainty in a diagnosis of heart failure.21 The provision of BNP reduced diagnostic uncertainty and improved diagnostic accuracy.
Associations between certainty self-rating and process or outcome measures
Green et al found that patients who had diagnoses rated with certainty had shorter median length of stay than those rated with uncertainty (5.4 vs 6.6 days, p=0.02).23 Age-adjusted Cox proportional hazards analyses showed uncertainty to be an independent predictor of death (hazard ratio (HR) 1.9, 95% confidence interval (CI) 1.0–2.3, p=0.05) as well as death or rehospitalisation (HR 2.2, 95% CI 1.7–2.5, p=0.01) by 1 year. In contrast, Lave et al found that certainty rating was not associated with vital status at discharge but did account for variation in all measures of resource utilisation assessed except adjusted pharmacy charges.19 Bruyninckx et al found that in general practice, the proportion of urgent referrals for chest pain was similar between certain (14%) and uncertain (18%) ratings.22 The main difference observed was in decisions not to refer (occurring 65% of the time when certain vs 29% of the time when uncertain) and non-urgent referrals (occurring 20% of the time when certain or 54% of the time when uncertain).
Discussion
This is, to our knowledge, the first review of studies measuring the real time self-rated certainty of decisions made by clinicians and there are several prominent findings. First, all studies were generally at high risk of bias making applicability to contemporary clinical practice difficult. Second, the decision to which a certainty self-rating was applied related to diagnosis in every case and was often not the main focus of the study. Third, certainty tended to be measured by a quantitative rather than qualitative scale though visualisation of the distribution of certainty ratings was only present in two of nine studies. Fourth, extreme certainty ratings (0 or 100% likelihood of heart failure) were reported in two studies, one of which found a 7% diagnostic error rate among these cases. Fifth, very few studies reported clinical outcomes, such as length of stay, readmission or long-term mortality, but in the one study that did, there was an association between uncertainty and worse outcomes.
Comparison with the literature
Comparing our findings with the existing non-real time literature requires caution. Traditionally, these have taken the form of vignette studies that provide experimental rigour. For example, Meyer et al investigated diagnostic calibration (the relationship between diagnostic accuracy and the confidence in that accuracy).5 They were able to adjust both the difficulty of the diagnostic case and the stage of evolution of the case (ie early with little information or later on with more test results). This ability to adjust is a major advantage of vignette studies. They reported that diagnostic calibration was worse with more difficult cases (accuracy dropped as expected but confidence stayed at a reasonably similar level). They also noted that higher confidence was associated with decreased requests for additional diagnostic tests. However, one study in our review reported that imaging requests were similar between junior and senior doctors despite increased confidence recorded by the seniors.20 This difference may exemplify the problems inherent in comparing vignettes studies to real clinical practice.
More recently, a vignette study from Lawton et al found that more experienced emergency medicine doctors were more tolerant of uncertainty and that this explained about a quarter of the relationship between experience and lower risk aversion.10 However, the authors stated that they could not conclude from their design that less risk-averse strategies improve patient safety. A major weakness of vignette studies is that by their nature they cannot provide information on real patient outcomes.
Studies using questionnaires or surveys can address this issue by categorising a cohort of clinicians into various groups based on their responses and then looking at their outcomes. For example, Baldwin et al measured risk aversion and diagnostic uncertainty among 46 doctors by administering a structured questionnaire and then looked to see if the results were associated with their decisions to admit infants with bronchiolitis.9 When adjusted for severity of illness, physicians’ risk attitudes were not associated with admission rates. However, data from our review suggests that certainty was associated with outcomes such as length of stay and readmission.23 One reason for differences between questionnaire and real time studies may be that the former take an aggregate level overview while the latter present more granular data on each individual decision. A clinician rating their personality as risk averse and afraid of uncertainty may nevertheless have a spectrum of certainty self-ratings depending on the particulars of an individual case and other factors such as time of day, presence of senior support and volume of diagnostic results available. Such granularity cannot be captured in a questionnaire study.
Implications for clinicians and researchers
Our findings have several important implications. First, all of the studies we identified focused on diagnosis. While diagnostic error has become a major target for patient safety efforts, there are many other decisions within healthcare that also have a large impact on patient outcomes such as treatment choice and discharge decisions. These could also make valuable targets for future research in this area.28
For clinicians, there may be educational value, both for trainees and senior doctors in seeing how their self-rated certainty relates to longer-term outcomes and indeed how they compare to their peers. Future studies could couple this information into an audit and feedback system that facilitates reflective practice.29 The mere act of quantifying a certainty rating might form a cognitive brake and thus an important debiasing strategy that mitigates against avoidable medical error occurring from under- or over-confidence.30–33
For researchers, there is very weak existing evidence to suggest a relationship between certainty self-ratings and outcomes such as mortality, readmission and length of stay. At the very least, this may imply that uncertainty could act as a surveillance metric for patients at higher risk of an adverse outcome during their admission (or even after discharge) and therefore a potential target for quality improvement efforts. It is possible that improvements in clinical processes such as reduced waste from over-investigation or better patient flow to appropriate environments could arise as a result of highlighting patients where decisions are made with high or low certainty. However, more prospective work focusing on a broader medical context and identifying the many factors that influence clinician certainty will be required before interventions can be tested.
Limitations
Our findings should be considered in light of several limitations. First, although comprehensive, our search may nonetheless have missed some potentially includable studies. As with any review, a degree of pragmatism is required to balance the identification of studies with logistical constraints. This is especially important in this review as we were effectively looking for medical decisions in any sphere of clinical practice and therefore the search terms could not realistically have been any broader than they were. Second, we elected a priori not to perform quantitative synthesis as we anticipated that there would be extensive heterogeneity. While this makes sense from a methodological point of view, it nonetheless means that we cannot, from this review, give quantitative estimates of the benefits of measuring decision certainty.
Conclusions
The literature on self-rated real time certainty of clinician decisions focuses exclusively on diagnosis and is notably sparse. All studies were generally at high risk of bias making applicability to contemporary clinical practice difficult. The next challenge will be to generate hypotheses for testable interventions that can better calibrate clinician certainty with accuracy of decision making. This could be valuable in reducing diagnostic error and improving outcomes.
Funding
There was no specific funding for this study. MN and YC are supported by National Institute for Health Research (NIHR) academic clinical fellowships. Anthony C Gordon (ACG) is funded by a UK NIHR research professor award (RP-2015-06-018). MN and ACG are both supported by the NIHR Imperial Biomedical Research Centre.
Supplementary material
Additional supplementary material may be found in the online version of this article at www.clinmed.rcpjournal.org:
S1 – Search strategy.
S2 – Risk of bias.
- © 2019 Royal College of Physicians
References
- ↵
- ↵
- ↵
- McGlynn EA
- ↵
- Meyer A
- ↵
- Meyer AND
- ↵
- Meyer AND
- ↵
- ↵
- Treadway N.
- ↵
- ↵
- Lawton R
- ↵
- Yee LM
- ↵
- Alam R
- ↵
- Bhise V
- ↵
- ↵
- Strout TD
- ↵
- Moher D
- ↵
- CASP
- ↵
- Higgins JPT
- ↵
- Lave JR
- ↵
- Acute Abdominal Pain Study Group
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- McCullough Peter A
- ↵
- ↵
- Prescott HC
- ↵
- Patel S
- ↵
- Croskerry P
- ↵
- Croskerry P
- ↵
- ↵
Article Tools
Citation Manager Formats
Jump to section
Related Articles
- No related articles found.