A general critical appraisal tool: An evaluation of construct validity

https://doi.org/10.1016/j.ijnurstu.2011.06.004Get rights and content

Abstract

Background

Many critical appraisal tools (CATs) exist for which there is little or no information on development of the CAT, evaluation of validity, or testing reliability. The proposed CAT was developed based on a number of other CATs, general research methods theory, and reporting guidelines but requires further study to determine its effectiveness.

Objectives

To establish a scoring system and to evaluate the construct validity of the proposed critical appraisal tool before undertaking reliability testing.

Methods

Data obtained from this exploratory study along with information on the design of the proposed CAT were combined to evaluate construct validity using the Standards for educational and psychological testing which consist of five types of evidence: test content, response process, internal structure, relations to other variables, and consequences of testing. To obtain data for internal structure and relations to other variables, the proposed CAT was analysed against five alternative CATs. A random sample of 10 papers from six different research designs across the range of health related research were selected, giving a total sample size of 60 papers.

Results

In all research designs, the proposed CAT had significant (p < 0.05, two-tailed) weak to moderate positive correlations (Kendall's τ 0.33–0.55) with the alternative CATs, except in the Preamble category. There were significant moderate to strong positive correlations in the quasi-experimental (τ 0.70–1.00), descriptive/exploratory/observational (τ 0.72–1.00), qualitative (τ 0.74–0.81), and systematic review (τ 0.62–0.82) designs and to a lesser extent in the true experimental (τ 0.68–0.70) design. There were no significant correlations in the single system research designs.

Conclusions

Based on the results obtained, the theory on which the proposed CAT was designed, and the objective of the proposed CAT there was enough evidence to show that inferences made from scores obtained from the proposed CAT should be sound.

Introduction

The purpose of a critical appraisal tool (CAT) is to assist readers to rate research papers based on the research methods used and the conclusions drawn. CATs are primarily used in systematic reviews, but they are also useful in literature reviews and in journal clubs, or anywhere a reader wishes to remain objective regarding a research paper. However, there are a number of well documented problems with many existing CATs. These problems include:

  • 1.

    Tools that are limited in the research designs they can evaluate. In many cases only one research design can be evaluated by a particular CAT and different papers with different research designs cannot be directly compared (Crowe and Sheppard, 2010, Deeks et al., 2003);

  • 2.

    Tools lacking the depth to comprehensively assess papers being analysed so that not all aspects of the research undertaken are appraised (Deeks et al., 2003, Moyer and Finney, 2005);

  • 3.

    Tools that use inappropriate scoring systems, such as simplistic summary scores, in such a way that defects in studies may be hidden (Armijo Olivo et al., 2008, Deeks et al., 2003, Jüni et al., 1999); and,

  • 4.

    Tools developed with little regard for basic research techniques so that there is limited or no validity or reliability data. Therefore, these tools cannot claim to validly and reliably assess the research appraised (Bialocerkowski et al., 2004, Crowe and Sheppard, 2010, Maher et al., 2003).

A recent review of CAT design suggested a new structure for a CAT to overcome problems one and two (Crowe and Sheppard, 2010). The structure was based on a qualitative analysis of 44 CATs where information on the design of the CATs was available. The analysis used was the constant comparative method where each item from one CAT was compared with items from other CATs, so that distinct categories of items were created. A combination of general research methods theory and standards for the reporting of research, such as CONSORT (Moher et al., 2001), PRISMA (Liberati et al., 2009) and COREQ (Tong et al., 2007), were also used. The 44 CATs that were analysed could be used to appraise different research designs, so the proposed CAT could potentially be used across all those research designs. This process culminated in: a list of eight categories (Preamble, Introduction, Design, Sampling, Data collection, Ethical matters, Results, and Discussion), where each category included similar information and there was no overlap between categories; 22 items; and a large number of descriptors for each item upon which a research paper could be assessed.

This paper builds on the review by Crowe and Sheppard (2010) by tackling the two outstanding details of the proposed CAT yet to be resolved: the scoring system (problem 3); and, validity and reliability testing (problem 4). Validity and scoring are the subject of this paper and they were considered independently from reliability because validity and scoring are heavily intertwined, and validity is “the most fundamental consideration in developing and evaluating” a tool (American Educational Research Association et al., 1999, p. 9). Without proper and thorough validity testing, it is irrelevant whether a tool has reliability (American Educational Research Association et al., 1999, pp. 9–11). In other words, if all raters independently agree on a score a paper should receive (reliability) this is immaterial if the score does not accurately reflect what is being measured (validity). Therefore, validation of the proposed CAT was required before reliability, which is the subject of a separate paper (Crowe et al., 2011), could be examined.

However, before exploring the methods used in this research, the exact nature of validity needs to be explained. This is necessary to counteract the persistent belief that: (1) the definition of validity is, ‘does the test measure what it is supposed to measure?’; and, (2) that validity consists or is the sum of many different types of validity, e.g. content, criterion, construct, face, divergent, convergent, predictive, concurrent (Streiner and Norman, 2008, pp. 249–252).

To overcome shortfalls in validity theory, the definition of validity was expanded to four types between 1952 and 1954 by the American Psychological Association (APA), the American Educational Research Association (AERA), and the National Council on Measurements Used in Education (NCMUE) (American Psychological Association et al., 1954). The four types of validity identified were content, predictive, concurrent, and construct validity. However, predictive and concurrent validity were, even at that time, normally considered to be part of the same type of validity called criterion-related validity (Cronbach and Meehl, 1955). This view of validity lead directly to the threes C's view of validity that is often used today and which consists of three separate types of validity (content, criterion, and construct) (Streiner and Norman, 2008, pp. 249–252).

Almost as soon as the four types model of validity was introduced, some authors were doubtful of the conclusions made. In 1955, Cronbach and Meehl stated that content validity could be considered as part of construct validity (Cronbach and Meehl, 1955). In 1957, Loevinger went even further to state that predictive, concurrent, and content validity were ad hoc hypotheses and of no scientific importance. This meant that only construct validity was worthy of study, even if the other validities existed (Loevinger, 1957).

Further research into validity throughout the 1960s and 1970s led to the establishment of a unified theory of validity in the mid-1980s which stated that construct validity was the only validity (American Educational Research Association et al., 1999, pp. 9–11; Messick, 1995). This unitary view maintained that validity refers to “the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests”, where test refers to an evaluation method (American Educational Research Association et al., 1999, p. 9). In other words, validity referred to the interpretation of scores based on:

  • 1.

    The theory upon which the test was built, and

  • 2.

    The evidence for how the scores can be interpreted, and

  • 3.

    The stated context for test use.

Therefore, it cannot be claimed that a test is valid. All that can be said is that under the assumptions around which a test was built, the evidence shows that scores can be interpreted in a certain way. If any of the theoretical, evidential, or contextual aspects of the test change, then validity must be re-examined and interpretation of the score may also change. In short, validity is an ongoing process where evidence for how test scores can be interpreted is required each time a test is administered (American Educational Research Association et al., 1999, p. 11; Messick, 1995, Strauss and Smith, 2009).

The unified approach to validity was formalised in the Standards for educational and psychological testing in 1985 and further refined in 1999 (American Educational Research Association et al., 1985, American Educational Research Association et al., 1999). Given the poor track record in CAT validity (Crowe and Sheppard, 2010), the Standards were used in validity testing the proposed CAT because: (1) they are a clear guide to validity; and, (2) they can be applied to any evaluation method and, therefore, can be applied to a CAT which is a method for evaluating research (American Educational Research Association et al., 1999, p. 3).

Evaluating construct validity is considered a mixture of reasoned argument, theoretical foundations, and empirical evidence which together support the credibility of score interpretation (Messick, 1995, Strauss and Smith, 2009, Streiner and Norman, 2008, pp. 249–252). To evaluate construct validity, five types of evidence are gathered: Test content, Internal structure, Response process, Relations to other variables, and Consequences of testing (American Educational Research Association et al., 1999, pp. 11–17). These types of evidence are described below in relation to CATs.

Test content explores the specification of the construct, analysis of test content against the construct (e.g. themes, words, formats, questions, procedures, guidelines), and threats to construct validity. Analysis is a mixture of logic, empirical evidence, and expert judgement (American Educational Research Association et al., 1999, pp. 11–12).

The major threats to construct validity are construct underrepresentation and construct-irrelevant variance, either or both of which may be present within a test (American Educational Research Association et al., 1999, p. 10; Messick, 1995). Construct underrepresentation is when a test is too narrowly focused and fails to include important aspects of a construct. In a CAT, this means that certain aspects of research evaluation may be omitted and the resulting score is not representative of the breadth of research methods used. This is an argument that could be used against the Jadad scale, a CAT commonly used to appraise true experimental research designs, which has only three criteria against which to appraise a research paper (Jadad et al., 1996).

Construct-irrelevant variance is when a test is too broad and includes items that are not relevant to the construct being measured (American Educational Research Association et al., 1999, p. 10; Messick, 1995). In a CAT, this can mean three things: (1) the tool includes items which over-represent one aspect of research design (e.g. multiple references to blinding); (2) the tool includes items that favour one research design over another (e.g. true experimental designs over any other designs); or, (3) the tool includes items that are not related to appraising research. In cases 1 and 2, a positive or negative bias is introduced toward one or more aspects of research design while in case 3, the items should be removed from the test.

In designing a CAT, there is a struggle between ensuring that representative items for a construct are included in the test and also ensuring that none of the items can lead to one or more aspects of research design having an advantage or disadvantage (Messick, 1995). However, what constitutes construct underrepresentation or construct-irrelevant variance is open to interpretation, though methods such as expert consensus can be used to reduce subjectivity. For example, some authors argue that only items which refer to how a research design was implemented (also known as internal research validity) should be included in critical appraisal while others argue that critical appraisal is a far broader undertaking and should include ethics and the suitability of a research design to the questions being asked (Dixon-Woods et al., 2006, Petticrew, 2001).

The relationship between test items should reflect the construct on which score interpretations are based. For example, if a construct is unidimensional, then the test items must be homogeneous within the dimension posited, and interpretation of the score must be based on the assumed unidimensionality of the construct. Internal structure also asks whether items in the test function differently for different subgroups of test takers. Analysis is theoretical and empirical (American Educational Research Association et al., 1999, p. 13).

Whether a construct is unidimensional or not, current construct validity theory favours measuring homogeneous constructs together with a single score (American Educational Research Association et al., 1999, p. 13; Strauss and Smith, 2009). The advantage of this approach is that changes to the score reflect changes in measurement of the underlying construct. Previously, scores could represent multiple (heterogeneous) constructs and changes in a score could not be attributed to any particular construct. However, multiple homogeneous constructs may be added together where there is enough theoretical and empirical evidence to show that a single score will enhance understanding without impairing precision of score interpretation (Strauss and Smith, 2009).

The internal structure of research in relation to CATs can be envisioned in four ways:

  • 1.

    Too complex to be reduced to numbers.

  • 2.

    A unidimensional construct that can be allocated a single score.

  • 3.

    Multiple constructs that should be scored for each individual construct without summing the scores.

  • 4.

    Multiple constructs where summing individual construct scores into a single score does not affect the precision of the scores.

There is an air of intellectual laziness about the first point, which supposes anything complex is too difficult to be understood in simple terms, which flies in the face of scientific reason. It has been shown in previous studies that a single score for research has many problems, such as hiding weak sections of the research, and should not be encouraged (Crowe and Sheppard, 2010, Deeks et al., 2003). Therefore, points 3 and 4 were explored as methods for scoring research below.

Response process ensures that there is a fit between the processes used by a test taker to deliver a response and the construct being tested, e.g. this is why test takers in mathematics are asked to show how they arrived at an answer and not just provide the final result. The response process also includes whether test scores can be interpreted in the same way across subgroups of test takers. Analysis is theoretical and empirical (American Educational Research Association et al., 1999, pp. 12–13).

A CAT, therefore, should give a reader the ability to include where they found evidence for different aspects of the research and why they thought this constituted evidence for or against giving a particular score given to the research. By having information about the reader, such as their research experience, evidence can be gathered about potential differences between subgroups of readers.

Test scores can be analysed in relation to scores from other tests of the same or related constructs, criteria the test is meant to predict, and measures other than test scores but which are hypothesised to be related to the construct in question. In terms of ‘traditional’ validity this encompasses such evidence as convergent, discriminant, predictive, and concurrent validity and validity generalisation. However, it must be remembered that these validities are part of the unitary concept of construct validity and not different types of validity (American Educational Research Association et al., 1999, pp. 13–16). Analysis is primarily empirical. In a new CAT, scores should be tested against existing CATs, where the existing CATs have validity and reliability data available, and are reported to test similar or the same constructs as the new CAT.

Test scores can be analysed in relation to intended and unintended consequences of score interpretation. Intended consequences of score interpretation occur when a benefit can be realised. However, that benefit must have the possibility of being realised and must not be over-stated, i.e. the claims must be backed by empirical evidence. Unintended consequences of score interpretation may occur when there are threats to construct validity (i.e. construct underrepresentation and construct-irrelevant variance) or when the test scores are misinterpreted or misused. It is generally up to the test developers (the authors of the test) and test users (administrators of the test) to ensure that misinterpretation and misuse does not occur (American Educational Research Association et al., 1999, pp. 16–17).

A CAT should not over-state its usefulness in appraising a paper and should not be used outside the research methods it was designed to appraise. For example, the scores obtained from using the Jadad scale (Jadad et al., 1996) for any research except health related true experimental research could not be considered valid until it has been subject to appropriate validity testing. Furthermore, most CATs are designed to be used for a particular project and it would, generally, be inappropriate to use the scores obtained from a CAT in one project for another project because the contexts would be different. There are exceptions, such as the PEDro scale (Maher et al., 2003), where a CAT has undergone extensive validity and reliability testing for this purpose.

This paper aims to evaluate the validity of the scores collected from the proposed CAT using the approach outlined above. Test content and internal structure evidence were primarily collected in a review of CATs (Crowe and Sheppard, 2010). Response processes and relations to other variables evidence were largely collected in this study, as outlined in the Methods and Results sections. The Discussion section combines arguments from the previous study, the results in this paper, and statements regarding consequences of testing to form a theoretical, logical, and empirical argument for the validity of the scores obtained by the proposed CAT.

Section snippets

Methods

To meet the aims of the research regarding the scoring system for the proposed CAT and its relations to other CATs, these steps were undertaken:

  • 1.

    Develop the scoring system and a user guide for the proposed CAT

  • 2.

    Pre-test the proposed CAT and user guide, and make amendments where necessary

  • 3.

    Compare the scores achieved by the proposed CAT against the scores achieved by an alternative CAT or CATs, where the alternative CAT or CATs must have validity and reliability data available.

Results

From the papers randomly selected, a total of 36 papers were rejected because they did not have the required research design. The papers rejected breakdown as: six from true experimental; four from quasi-experimental; 12 from single system; one from descriptive, exploratory, and observational; three from qualitative; and, 10 from systematic review. A full list of papers used in pre-testing and the main study are available as Supplementary material 1.

Discussion

The discussion is based on the evaluation of construct validity as outlined. All information was based on the results obtained from a previous study of CAT design (Crowe and Sheppard, 2010) and the results of this study. It must be remembered that evaluation of construct validity is an on-going process. This discussion represents a preliminary evaluation of construct validity based on existing data. Further evaluation of construct validity will occur as more data are gathered in future

Conclusion

The benefits of the proposed CAT are that it is relatively simple to implement, can be used in all research designs in health, and the scores obtained can then be directly compared. Other tools which are said to have this capacity have not undergone a validity or reliability testing process.

This preliminary step in the process of validity testing will continue into subsequent studies. Meanwhile, based on the aims of this study and a previous study (Crowe and Sheppard, 2010), the proposed CAT

References (32)

  • J.W. Creswell

    Research Design: Qualitative, Quantitative, and Mixed Methods Approaches

    (2008)
  • L.J. Cronbach et al.

    Construct validity in psychological tests

    Psychological Bulletin

    (1955)
  • M. Crowe et al.

    A review of critical appraisal tools show they lack rigour: alternative tool structure is proposed

    Journal of Clinical Epidemiology

    (2010)
  • Crowe, M., Sheppard, L., Campbell, A., 2011. Reliability analysis for a proposed critical appraisal tool demonstrated...
  • J.J. Deeks et al.

    Evaluating non-randomised intervention studies

    Health Technology Assessment

    (2003)
  • M. Dixon-Woods et al.

    How can systematic reviews incorporate qualitative research? A critical perspective

    Qualitative Research

    (2006)
  • Cited by (116)

    View all citing articles on Scopus
    View full text