0 Comments

Description

Research studies often compare variables, conditions, times, and/or groups of participants to evaluate relationships between variables or differences between groups or times. For example, if researchers are interested in knowing whether an intervention produces change in the desired direction, they will want to know whether the change is due to chance (statistical significance) or possibly due to the intervention. In this case, researchers could use a pre and post measurement of the same participants on the condition being treated, or they could compare a group of individuals who receive the intervention to a group that does not receive the intervention. Researchers could also compare two groups of individuals who receive different interventions. The rigor of the research design helps control for other factors that might account for the changes (e.g., time, conditions, group differences in other factors, etc.).

To prepare for this Discussion, consider the concept of statistical significance.

Post

  • Explanation of how the difference between statistical significance and the true importance (clinical significance) of the relationship between variables or the degree of difference between groups affect your practice decision making.
  • Include an explanation of what statistical significance means.
  • Include an example from a quantitative study that found statistically significant differences.
  • Discuss whether the results of the study would—or should—influence your practice as a social worker.

Please use the resources to support your answer.

Use APA format and three peer reviewed references

Use subheadings for answer

Reference

Bauer, S., Lambert, M. J., & Nielsen, S. L. (2004). Clinical significance methods: A comparison of statistical techniques. Journal of Personality Assessment, 82, 60–70. Retrieved from Walden Library databases.

Yegidis, B. L., Weinbach, R. W., & Myers, L. L. (2018). Research methods for social workers (8th ed.). New York, NY: Pearson. Chapter 13, “Analyzing Data” (pp. 295–297, “The Data in Perspective”)

Journal of Personality Assessment
ISSN: 0022-3891 (Print) 1532-7752 (Online) Journal homepage: http://www.tandfonline.com/loi/hjpa20
Clinical Significance Methods: A Comparison of
Statistical Techniques
Stephanie Bauer , Michael J. Lambert & Steven Lars Nielsen
To cite this article: Stephanie Bauer , Michael J. Lambert & Steven Lars Nielsen (2004) Clinical
Significance Methods: A Comparison of Statistical Techniques, Journal of Personality Assessment,
82:1, 60-70, DOI: 10.1207/s15327752jpa8201_11
To link to this article: https://doi.org/10.1207/s15327752jpa8201_11
Published online: 10 Jun 2010.
Submit your article to this journal
Article views: 1654
View related articles
Citing articles: 108 View citing articles
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=hjpa20
JOURNAL OF PERSONALITY ASSESSMENT, 82(1), 60–70
Copyright © 2004, Lawrence Erlbaum Associates, Inc.
Clinical Significance Methods:
A Comparison of Statistical Techniques
CLINICAL
BAUER,
SIGNIFICANCE
LAMBERT, NIELSEN
METHODS
Stephanie Bauer
Center for Psychotherapy Research
Stuttgart, Germany
Michael J. Lambert
Department of Psychology
Brigham Young University
Steven Lars Nielsen
Counseling and Career Center
Brigham Young University
Clinically significant change refers to meaningful change in individual patient functioning during psychotherapy. Following the operational definition of clinically significant change offered
by Jacobson, Follette, and Revenstorf (1984), several alternatives have been proposed because
they were thought to be either more accurate or more sensitive to detecting meaningful change.
In this study, we compared five methods using a sample of 386 outpatients who underwent
treatment in routine clinical practice. Differences were found between methods, suggesting
that the statistical method used to calculate clinical significance has an effect on estimates of
meaningful change. The Jacobson method (Jacobson & Truax, 1991) provided a moderate estimate of treatment effects and was recommended for use in outcome studies and research on
clinically significant change, but future research is needed to validate this statistical method.
Clinically significant change refers to changes in patient
functioning that are meaningful for individuals who undergo
psychosocial or medical interventions. This concept has considerable value in research aimed at classifying each individual patient’s status with regard to normative functioning. In
this regard, it allows researchers to focus on the functioning
of each patient rather than on group averages and statistical
significance of between-group comparisons. Research using
operationalizations of clinical significance has been especially useful in estimating dose-response relationships (e.g.,
Anderson & Lambert, 2001) and in outcome management
systems that use it as a marker for recovery and deterioration
(Lambert et al., 2001). In addition, it has been used to estimate the relative value of empirically supported therapies as
examined in clinical trials versus routine practice (Hansen,
Lambert, & Forman, 2002).
In all these uses, the degree of change in the individual is
of primary interest. Such a focus is not only thought to be of
scientific importance but it leads to narrowing the gap between clinical research and clinical practice. Thus, the concept and its operationalization have generated considerable
interest. Its introduction by Jacobson, Follette, and
Revenstorf (1984) was regarded as an important advance in
methodology (Lambert, Shapiro, & Bergin, 1986), and it has
become an expected statistic in published outcome studies by
some journal editors. The topic of clinically significant
change has generated considerable attention in special journal sections devoted to the topic (e.g., Jacobson, 1988; Kendall, 1999; Kendall, Marrs-Garcia, Nath, & Sheldrick, 1999;
Tingey, Lambert, Burlingame, & Hansen, 1996b).
The original proposal (Jacobson et al., 1984) with minor
modifications (Jacobson & Truax, 1991) suggested a
two-step criterion for clinically significant change. First, a
cutoff point for a measure of psychological functioning is established that is conceptualized as a cutoff between two populations: a patient/dysfunctional population, and a
nonpatient/functional population. To this end, Jacobson and
Truax identified three reasonable cutoffs for consideration.
The first, Cutoff A, was defined as the point 2 SDs beyond
the range of the pretherapy mean (Cutoff A = Mclinical – 2
SDclinical). Cutoff A has high sensitivity, that is, an outcome
score below this score is very unlikely to belong to the pa-
CLINICAL SIGNIFICANCE METHODS
tient population, although it is not possible to draw conclusions about “recovery” without information on a functional
comparison group. The second, Cutoff B, was defined as the
point 2 SDs within a recognized functional mean (Cutoff B =
Mnonclinical + 2 SDnonclinical) and should be calculated if only
nonpatient data are available. Cutoff B has high specificity.
It is not difficult for most clients to attain because most dysfunctional and functional distributions overlap. The third,
Cutoff C, was a weighted midpoint between the means of a
functional and dysfunctional population (Cutoff C = [(SDclinical × Mnonclinical) + (SDnonclinical × Mclinical)] / (SDclinical +
SDnonlinical)). When both patient/dysfunctional and
nonpatient/functional data sets are available, and there is
overlap between the two distributions, C represents the best
choice for a cutoff point. Compared to A and B, it is the least
arbitrary score, as it is based on the relative probability of a
particular score ending up in one population as opposed to
another (Jacobson, Roberts, Berns, & McGlinchey, 1999). In
contrast, A and B would be chosen if not enough information for the calculation of C is available. For example, if adequate norms are lacking, A must be used because neither B
nor C can be calculated.
The second step of the Jacobson–Truax (JT; 1991) method
is to determine whether a client’s change from pretest to
posttest is reliable rather than simply an artifact of measurement error. To assess this, Jacobson et al. (1984) proposed a
reliable change index (RCI) that each participant has to pass
to demonstrate that his or her change is not simply due to
chance. RCIs are derived from the psychometric qualities of
the outcome measure used to estimate change: The formula
divides the difference between pretreatment and
posttreatment scores by a variation of the standard error of
measurement (SE). In the discussion of how to calculate reliable change most accurately, a frequent topic concerns the
question of which reliability estimate should be used to calculate the SE. Most studies use either test–retest coefficients
or internal consistency estimates. Martinovich, Saunders,
and Howard (1996) gave a detailed description of the advantages and disadvantages of both alternatives. In the end, they
recommended, especially for clinical populations, the use of
a measure of internal consistency rather than test–retest reliability. The problem with test–retest reliabilities is that in
clinical samples they are deflated by real individual differences in change, and there is no doubt that these changes occur even during very short periods of time and without the
patients being in therapy during that period (Howard,
Lueger, Maling, & Martinovich, 1993). Therefore, researchers often use test–retest reliability scores from nonpatient
samples. Agreeing with the explanations of Martinovich et
al. (1996), Tingey et al. (1996b) came to the conclusion that
the use of internal consistency would be the better way to calculate rates of reliable change. This makes especially good
sense because outcome scales typically attempt to measure
characteristics that should change over time rather than personality traits that are, by definition, stable over time.
61
Based on the two criteria (cutoff and RCI), the JT method
classifies individuals as Recovered (i.e., passed both cutoff
and RCI criteria), Improved (i.e., passed RCI criterion but
not the cutoff), Unchanged (i.e., passed neither criteria), or
Deteriorated (i.e., passed RCI criterion but worsened).
Jacobson and colleagues (Jacobson et al., 1999) are aware
that several difficulties are related to the approach of clinical
significance. For instance, the cutoff point always depends
on the specific samples used in a particular study as long as
there are no carefully collected norms for both dysfunctional
and normal populations. Without normative information, one
cannot evaluate if the sample included in an actual study is
representative or not. Furthermore, without any norms for
normal populations, Cutoff C cannot be calculated. Instead,
one has to use Cutoff A, which results in varying estimates of
clinical significance because means and standard deviations
vary from study to study. In contrast, if distribution information on both normative samples is available, and all participants score in the dysfunctional range at the beginning of
treatment, and the distributions do not overlap, it would be
senseless to use a cutoff score: In that case, a change from
dysfunctional to functional range would always be reliable
and never due to measurement error (Jacobson et al., 1999).
Another problem was addressed by Tingey, Lambert,
Burlingame, and Hansen (1996a). Tingey et al.’s main point
of criticism is that Jacobson et al. (1984) did not provide an
operationalization of a comparative social standard. As a
consequence, it is not possible to identify and use relevant
normative samples across studies. Additionally, the social
validation methodology is restricted by the use of only one
dysfunctional and one functional sample. Finally, no procedure to determine the distinctness of samples is available.
Due to these limitations, Tingey et al. proposed interesting
extensions for assessing clinical significance, focusing on
the derivation of relevant social standards. Even if these suggestions have been further extended and their value has been
acknowledged (Martinovich et al., 1996), there have been
hardly any studies really using them.
The “traditional” JT method for assessing clinically meaningful change is among the most frequently reported by researchers. In a review of outcome studies that reported
clinically significant change published in the Journal of Consulting and Clinical Psychology, Ogles, Lunnen, and
Bonesteel (2001) noted that the clinical significance method
originally proposed by Jacobson et al. (1984) was used in
35% of studies that employed some form of clinical significance calculation. No other method came close in terms of
frequency of use. Because of Jacobson and his colleagues’
(Jacobson et al., 1984) original suggestions, a general consensus on a conceptual definition of clinical significance has
developed: The status of a patient is characterized as clinically significantly changed when the client’s level of measured functioning is located in the nonfunctional range at the
beginning of treatment and in the functional range at the end
of treatment, if that change is statistically reliable. From a
62
BAUER, LAMBERT, NIELSEN
mathematical perspective, there are multiple ways to realize
this definition. In this article, we compare the JT statistical
approach to four alternative methods.
The JT method to calculate reliable changes has been
challenged by authors who believe that alternative methods
may be superior. As described previously, the RCI investigates the statistical significance of differences between pretreatment and posttreatment scores for each individual
person. Criticism of the original RCI conceptualization was
formulated by Hsu (1989) and Speer (1992) and focused on
the phenomenon of regression toward the mean that was not
taken into consideration by Jacobson and colleagues (Jacobson et al., 1984). Hsu (1989) introduced the
Gulliksen–Lord–Novick (GLN) approach that includes the
assumption that posttest–pretest regression effects are relevant to the interpretation of posttest scores. The same
limitation was noted by Speer (1992). Speer (1992) suggested calculating confidence intervals around the pretreatment scores (± 2 SDs) and evaluate the posttreatment scores
in relation to this interval. This approach is known as the Edwards–Nunnally (EN; Speer, 1992) method. The most recent
approach to assess clinical significance was presented by
Hageman and Arrindell (HA; 1999a). In contrast to all other
methods, Hageman and Arrindell (1999a) argued that the
rates of reliable or clinically significant change of a particular sample should not be calculated by summing up the results of the individual participants of that group. This would
result in an underestimation of the true rates of change. As a
consequence, Hageman and Arrindell (1999b) suggested different analyses for the individual versus the group rates of
change. In addition, Hageman and Arrindell’s (1999b) metrics attempt to correct for regression to the mean in providing
a closer approximation of the underlying true scores. The
postulated enhancement in precision (Hageman & Arrindell,
1999b) was questioned by McGlinchey and Jacobson (1999)
who consider the HA approach as “too complex for its value,
which has yet to be demonstrated” (p. 1216). Another aspect
is formulated by Speer (1992) who generally criticizes the
use of two-wave designs to assess change in psychotherapy
research. Speer (1992) recommended the use of growth
curve modeling (e.g., hierarchical linear modeling [HLM];
Bryk & Raudenbush, 1992) for the study of change. Besides
the parameter estimation based on multiwave data, the advantages of HLM are the use of the empirical Bayes estimation and the possibility of handling missing data (Bryk &
Raudenbush, 1992; Speer, 1992). Therefore, HLM is supposed to calculate clinically significant change more precisely than the other (two-wave) methods. Speer (1992)
argued that if using only pretreatment and posttreatment
scores, one should concentrate on the comparability of
change rate data among different studies and therefore use
the traditional RCI instead of creating new methods.
In addition to theoretical articles on the accuracy of different approaches, two research studies have been conducted to
evaluate the degree to which the various methods are redun-
dant or provide different estimates of clinically significant
change for the individual patient.
Speer and Greenbaum (1995) were the first to compare
Jacobson’s (Jacobson & Truax, 1991) method with other
approaches. Speer and Greenbaum compared the classification of patients based on the RCI (but not the functional/dysfunctional cutoff) of five methods within a
sample of 73 outpatients who were diagnosed with a range
of disorders and assessed with a self-report scale of
well-being. Speer and Greenbaum used the RCI of the original Jacobson method as summarized by Jacobson and
Truax (1991). They chose RCIs from three alternative
methods: EN (Speer, 1992), Hsu–Linn–Lord (HLL; Speer
& Greenbaum, 1995), and the Nunnally–Kotsch (NK;
Nunnally & Kotsch, 1983). These methods, like the original, classify individuals based on pretreatment and
posttreatment change scores. They differ from the JT
method in that they use residualized pretreatment (or difference) scores rather than raw change scores. This presumably increases precision because of increased reliability
(e.g., Rogosa, Brandt, & Zimowski, 1982). Speer and
Greenbaum also included in their comparison a method
based on HLM that has the presumed advantage of using
scores for each patient from more than just two time points.
Speer and Greenbaum (1995) found relatively high rates
of agreement between methods (from 77.7% to 81.2%) with
the exception of the HLL method, which provided lower
rates of recovery and higher rates of deterioration. The HLM
method provided the highest improvement rates. Speer and
Greenbaum recommended this approach for routine use.
Following this study, McGlinchey, Atkins, and Jacobson
(2002) attempted a replication of Speer and Greenbaum
(1995) by comparing five methods of estimating clinical
significance. McGlinchey et al. included three methods
used by Speer and Greenbaum (JT, EN, HLM) and replaced the HLL method with the related GLN approach.
Additionally, McGlinchey et al. added a fifth approach,
HA, that was proposed by Hageman and Arrindell (1999b).
McGlinchey and Jacobson (1999), having previously compared the HA procedures with the Jacobson procedures in
couple therapy, concluded that the methods were essentially equivalent. Hageman and Arrindell (1999a), however,
criticized these findings and the way in which reliability estimates were selected.
McGlinchey et al. (2002) used outcome data from 128 patients diagnosed with major depressive disorder who participated in one of three cognitive behavioral therapy treatments
and whose progress was followed up to 2 years after treatment. Because depressive symptoms were used to exclude
potential participants, only those with high depression as
measured by the Beck Depression Inventory (BDI; Beck et
al., 1961) were included in the study. This analysis used the
HA method in place of the NK method used by Speer and
Greenbaum (1995). This was done because both methods
take into account both pretest and posttest reliabilities in de-
CLINICAL SIGNIFICANCE METHODS
termining reliable change, and the HA method was the newer
approach. We followed the same procedure as McGlinchey
et al. by including the HA method and excluding the NK
method and by replacing the HLL with the GLN approach.
This study partially replicated the work of Speer and
Greenbaum (1995) and McGlinchey et al. (2002) by examining five methods of calculating clinically significant
change. It differs from both studies by using the Outcome
Questionnaire (OQ–45) as the measure of patient functioning. Like the Speer and Greenbaum analysis, this study was
based on a general outpatient sample with multiple diagnoses rather than a single disturbance (as was done by
McGlinchey et al., 2002) Like McGlinchey et al., clinically
significant change was examined rather than limiting the
analysis to estimating reliable change, as was done by
Speer and Greenbaum. We used a larger sample than either
of the previous studies, increasing the likelihood of finding
existing differences, but shared the goals of both studies by
attempting to examine differences in estimates of clinically
significant change that are a function of calculation methods for examining the consequences of using one method
instead of another.
METHOD
Participants
In this study, we used data from 386 clients who had sought
treatment at a university-based outpatient clinic. Clients
ranged in age from 18 to 54 years (M = 22.88, SD = 3.54)
and were 66% female, 86% White, 4.8% Latino/Latina, and
9.2% other or mixed ethnicity. Clients were diagnosed by
their treating clinician without the benefit of research-based
diagnostic evaluations. At intake, 74.6% of clients were diagnosed by their treating clinician with a Diagnostic and
Statistical Manual of Mental Disorders (4th ed.; American
Psychiatric Association, 1994) disorder, whereas 25.4%
had their diagnosis deferred and never had a formal diagnosis entered into the database. Those receiving a diagnosis
had a mood disorder (29.2%), adjustment disorder (12.4%),
anxiety disorder (10.1%), or eating disorder (7.0%).
Thirty-five percent of clients had a V-code diagnosis,
whereas the remainder (6.3%) received a variety of other
diagnoses. Ninety percent of clients had 10 or fewer sessions, with a mean dosage of 8 sessions (range = 2 to 27).
The clients in this study started treatment less disturbed (M
= 68.35, SD = 22.34) than those in routine outpatient care
(M = 80.98, SD = 24.84) but had scores similar to those reported in other university counseling centers (Lambert,
Hansen, et al., 1996).
Therapists were 48 university counseling center staff consisting of 27 doctoral level psychologists and 21 doctoral
trainees, including interns. Therapists had a variety of treatment orientations, most subscribing to an integration of two
63
or more systems. Licensed therapists did not use manually
guided treatments nor were their sessions recorded. They averaged about 14 years of post-doctoral experience. Trainees
were supervised on a weekly basis. The most common orientations
were
cognitive
behavioral
(50%)
and
psychodynamic/interpersonal (20%). Psychotherapy was offered free to university student clients, with length of treatment based largely on client needs and preferences.
Outcome Measure
Psychological dysfunction was assessed using the OQ–45
(Lambert, Hansen, et al., 1996), which provided both the
measure of weekly change and the criterion measure for classification of patients into outcome groups (Recovered, Improved, No Change, or Deteriorated). The OQ–45 was designed to measure patient progress in therapy by repeated
administration during the course of treatment and at termination. In this study, it was administered to clients immediately
prior to their first appointment. Each week the clients had a
scheduled visit, they came 5 to 10 min early and completed
the questionnaire again. The OQ–45 provides a total score
based on all 45 items and three subscale scores: subjective
discomfort (intrapsychic functioning), interpersonal relationships, and social role performance. Only the OQ–45 total
score, which provides a global assessment of patient functioning, was used in our study.
The OQ–45 has been reported to have adequate reliability
and validity across a number of settings, including both clinical and normative populations. Research has indicated that
the OQ–45 is a psychometrically sound instrument with adequate test–retest reliability at a 3-week interval (r = .84;
Lambert, Burlingame, et al., 1996), and excellent internal
consistency (Cronbach’s α = .93; Lambert, Hansen, et al.,
1996). The OQ–45 has also been demonstrated to have acceptable concurrent validity coefficients ranging from .55 to
.88 (all significant at p < .01) with the Symptom Checklist–90–Revised (Derogatis, 1977), BDI, Zung Self-Rating
Depression Scale (Zung, 1965), Taylor Manifest Anxiety
Scale (Taylor, 1953), State–Trait Anxiety Inventory
(Spielberger, 1983), Inventory of Interpersonal Problems
(Horowitz et al., 1988), and the Social Adjustment Scale
(Weissman & Bothwell, 1976). Furthermore, the OQ–45 has
been shown to be sensitive to change in patients over short
time periods while remaining stable in untreated individuals
(Vermeersch, Lambert, & Burlingame, 2000).
Using formulas developed by Jacobson and Truax (1991),
clinical and normative data for the OQ–45 were analyzed by
Lambert, Hansen, et al. (1996) to provide cutoff scores for
the RCI and movement from dysfunctional to functional status. Patients who change in a positive or negative direction
by at least 14 points are regarded as having made reliable
change. This degree of change exceeds measurement error
based on the reliability of the OQ–45 and is one of the two
criteria posited by Jacobson and Truax as indicating clini-
64
BAUER, LAMBERT, NIELSEN
cally meaningful change. The cutoff on the OQ–45 for demarking the point at which a person’s score is more likely to
come from the dysfunctional population than a functional
population has been estimated to be 64. When a patient’s
score is 63 or lower, their functioning is considered more
similar to nonpatients than patients at that point in time.1
Passing this cutoff (from dysfunctional to functional) is the
second criterion posited by Jacobson and colleagues (Jacobson et al., 1991) as an indicator of clinically significant
change. Patients who show reliable change and pass the cutoff are considered Recovered, whereas those who only show
reliable change are considered Improved. Support for the validity of the OQ–45’s reliable change and clinical significance cutoff scores have been reported by Lunnen and Ogles
(1998) and Beckstead et al. (2003).
Procedure
OQ–45s from 386 clients who had pretherapy and
posttherapy scores plus at least one additional measurement
were used to compare five methods for estimating clinically
significant change: three approaches used by Speer and
Greenbaum (1995)—the JT and EN, and the HLM approach—as well as the GLN and the HA approach as used by
McGlinchey et al. (2002). Computational formulae and details for each method are described in the Appendix. To give a
clearer picture of the single methods and to illustrate the calculation procedures, the Appendix includes an example of
calculations based on data from a single patient. These calculations show in detail how the RCI is estimated with the four
methods using pretreatment and posttreatment scores.
All of the methods assume continuous data. Four of the
methods rely exclusively on the use of pretest and posttest
scores, whereas the HLM method used pretest, posttest, and
all available OQ–45 data points in between.
The average OQ–45 pretest score was 68.35 (SD = 22.33).
The mean difference score was 10.9 (SD = 19.18), indicating
1
The “original” cutoff (63 points) was calculated using the samples reported in the OQ–45 manual, that is, a clinical sample with a
mean score of 79.8 (SD = 25.3) and a nonclinical sample with a mean
score of 48.7 (SD = 20.2). In this study, we used two other samples in
which the mean was 68.35 (SD = 22.33) for the clinical and 49.04
(SD = 17.3) for the nonclinical sample. The nonclinical data are
comparable to those used in the OQ–45 manual, with similar impairment. However, the patients in our study were less impaired than the
clinical sample in the manual. As a consequence of this lower impairment of the patient sample, Cutoff C needed modification and
resulted in a lower cutoff score. Cutoff C is the weighted midpoint
between the means of a functional and dysfunctional population and
therefore always “reacts” directly to the samples used in a specific
study.
The calculation of C = 57 was done as follows:
(SDclinical M nonclinical ) + (SDnonclinical Mclinical )
SDclinical + SDnonclinical
2233 × 49.04 + 17.3 × 68.35
=
= 57.4.
22.33 + 17.3
Cutoff C =
average gains following treatment that were comparable to
gains made by clients seen in similar settings that also have
low treatment dosage (Hansen et al., 2002). The corresponding effect size (Cohen’s d) for the pretest–posttest change
was .48, a modest effect.
Lambert, Hansen, et al. (1996) reported a reliability coefficient for the OQ–45 of .93 (Cronbach’s alpha) based on
normative data from several patient and nonpatient samples.
For reasons explained previously, this value was used for all
clinical significance calculations. The resulting SE amounts
to 5.9 (computation formula is included in the Appendix).
Just to illustrate the difference, the score was also calculated
using test–retest reliability (.84). The corresponding SE is
8.9. Using this much higher error would lead to more patients
being classified as Unchanged.
Following the procedures of McGlinchey et al. (2002),
Cutoff C was used for calculation of clinical significance.
This cutoff is based on information about both the functional
and dysfunctional samples. The HA method calculated a “C
True” by multiplying the means by their reliability data,
which yielded a cutoff of 57 for that method.
GLN method. In synthesizing the work of Lord and
Novick (1968) and Linn and Slinde (1977), Hsu (1989, 1995,
1996) was the first researcher to provide an alternative
method to Jacobson et al. (1984), referred to as the GLN
method. Hsu (1989) suggested that the RCI index of the Jacobson et al. (1984) method did not take into account the phenomenon of regression to the mean that might considerably
influence outcome classifications. Regression to the mean
implies that more extreme scores because of imperfect reliability, will naturally tend to become less extreme over repeated assessments. The GLN method attempts to control for
this potential confound by including a hypothesized population mean (and standard deviation) toward which scores
would regress. These are suggested to be the scores of a “relevant group,” for example, the group from which the participants of the study were selected (Hsu, 1999). If there is not
such a population or their scores are not available, Hsu
(1999) advocated using the pretreatment scores of the participants. This was done in our study.
EN method. The EN method was presented by Speer
(1992). This method synthesizes the work of Edwards,
Yarvis, Mueller, Zingale, and Wagman (1978) and Nunnally
(1967, 1975) who advocated formulation of confidence intervals for calculating prechange to postchange rates. The EN
method establishes reliable change by observing a participant’s posttest score relative to an established confidence interval around the estimated true pretreatment score of the individual. Speer concluded that, like the GLN method, the EN
approach would be an improvement on the original clinical
significance method by minimizing the influence of regression to the mean in the calculation of improvement rates. Fur-
CLINICAL SIGNIFICANCE METHODS
65
TABLE 1
Rates (Percentages and Frequencies) of Reliable Change Across Five Methods of Calculating Reliable
Change Using the Total Sample
Deteriorated
Approach
Jacobson–Truax
Gulliksen–Lord–Novick
Edwards–Nunnally
Hageman–Arrindell
HLM
Note.
Unchanged
Improved
%
Frequency
%
Frequency
%
Frequency
6.5
5.2
11.9
8.8
4.9
25
20
46
34
19
58.5
59.3
42.0
66.6
65.3
226
229
162
257
252
35.0
35.5
46.1
24.6
29.8
135
137
178
95
115
N = 386. HLM = hierarchical linear modeling.
thermore, the ease of presentation offered by confidence intervals is an additional benefit of this method.
HA method. This is one of the most recent clinical significance methods, developed by Hageman and Arrindell
(1999b). Drawing on Cronbach and Gleser’s (1959) use of
the phi coefficient as a measure of discrimination, the HA
method involves the most significant revisions to the JT
method. Among its distinguishing features, the HA method
differentially analyzes clinically meaningful change at the
individual level (i.e., participant to participant) and at the
group level (i.e., obtaining proportions of participants in the
sample who have reliably changed and passed the cutoff
point). The RCI (RCINDIV) of the method is determined by incorporating both pretest and posttest reliabilities in its calculations, purporting to enhance precision further. In addition,
the HA method is the first to modify the cutoff criterion, applying the same corrections for regression to the mean to the
cutoff as are used in the RCINDIV.
HLM. Speer and Greenbaum (1995) recently advocated
a multiwave data approach using growth curve modeling
(e.g., HLM; Bryk & Raudenbush, 1992). One of the advantages of a multiwave approach is that it uses more than two
data points per individual; by doing so, it reflects the change
that occurs between pretest and posttest assessments more
precisely. Besides the parameter estimation based on
multiwave data, further advantages of HLM are the use of
empirical Bayes estimates, which are weighted estimates that
combine information from the individual and the sample as a
whole as well as the capability of handle missing data.
Following calculations of the clinical significance of
change for each method, differences between methods were
analyzed through the use of nonparametric statistics.
RESULTS
Table 1 presents the overall rates of classifying the client’s reliable change for the five clinical methods for the total sample of
386 clients. The EN method classified the fewest number of
clients as unchanged (42%) and the greatest percentage of cli-
TABLE 2
Paired Comparisons Between Reliable
Change Classifications
Approach
JT
GLN
EN
HA
HLM
JT
GLN
EN
HA
HLM

ns
p < .01
p < .001
p < .05
.92

ns
p < .001
p < .001
.71
.70

p < .001
p < .001
.76
.72
.59

p < .001
.78
.80
.59
.67

Note. Significance levels for paired comparisons (Wilcoxon test) between
the reliable change classifications of the five approaches are underlined.
Kappa coefficients for the agreement between methods are in the right upper
quadrant. JT = Jacobson–Truax; GLN = Gulliksen–Lord–Novick; EN =
Edwards–Nunnally; HA = Hageman–Arrindell; HLM = hierarchical linear
modeling.
ents as improved (46%) and deteriorated (12%). The HA and
HLM methods classified the largest number of clients as unchanged (67% to 65%) and the smallest number as improved.
There was little difference between the JT and GLN methods
in c

Order Solution Now

Categories: