Establishment of a Reliable Change Index for the GAD-7

Aim: It is increasingly important for mental healthcare providers and researchers to reliably assess client change, particularly with common presenting problems such as anxiety. The current study addresses this need by establishing a Reliable Change Index of 6 points for the GAD-7. Method: Sample size included 116 online community participants using Amazon’s Mechanical Turk (MTurk) and archival data for 332 clinical participants. Participants completed measures of the GAD-7 and the MDI in 2 rounds. Using previously established cutoff scores and Jacobson and Truax’s (1991) method, we establish a Reliable Change Index which, when applied to 2 administrations of the GAD-7, indicates if a client has experienced meaningful change. Results: For the GAD-7, the mean score for the clinical sample was 10.57. For the community sample at Time 1, the mean score was 4.14. A Pearson’s correlation was computed to assess the 14-28-day test-retest reliability of the GAD-7, r(110) = .87, indicating good testretest reliability. Conclusion: Using the RCI equation, this resulted in an RCI of 5.59. For practical use the RCI would be rounded to 6.

It is important to both clinicians and researchers to measure the effectiveness of mental health treatment.
The community in which they serve benefits from solid research and effective implementation of findings.
Furthermore, third party payers, such as insurance companies, are also invested in quantifiably measuring client change as a result of mental health interventions. Focus on appropriate mental health measurements is increasing as knowledge and services are expanding all over the world (e.g. instruments being translated and validated in multiple languages; Carvalho, Marques, Ferreira, & Lima, 2016;Dias, Silva, Maroco, & Campos, 2015).
One method of measuring the effects of treatment is to administer questionnaires to clients to first assess their baseline state on some construct, and then to re-administer the questionnaire at a later time point to ascertain if there has been any change. Many instruments have been validated through assessment of its psychometric properties (Losoi et al., 2013;Marques et al., 2013;Pimenta, Leal, & Maroco, 2012) and comparison with other established measurements (Barry, Folkard, & Ayliffe, 2014). One common construct of interest to clients, clinicians, and third parties is anxiety, which is one of the most frequently presented problems that clients report when they seek therapy (Heafner, Silva, Tambling, & Anderson, 2016). Not only is anxiety a common presenting problem, it is a risk factor for a variety of physical ailments including cardiovascular disease (Player & Peterson, 2011), one of the costliest public health concerns. Furthermore, anxiety can occur and has been measured in a variety of settings (e.g. at the dentist; Campos, Presoto, Martins, Domingos, & Maroco, 2013).
In the mental health field, the presence and severity of anxiety symptoms are frequently measured by the Generalized Anxiety Disorder-7 scale (GAD-7; Spitzer, Kroenke, Williams, & Löwe, 2006). This is a self-report questionnaire designed to screen for severity of symptoms associated with generalized anxiety disorder (Rutter & Brown, 2017). The GAD-7 correlates highly with other measures of anxiety and it is also used in detecting the presence of many specific anxiety disorders. The GAD-7 was developed as a brief, self-report measure of anxiety, through assessment of symptoms of anxiety (Spitzer et al., 2006). For more information regarding the development, norming, and psychometric testing of the GAD-7, see Spitzer et al. (2006). We conducted a search on PsychINFO for published peer-reviewed articles that cited the GAD-7 since it was published in 2006.
Research on the effectiveness of treatments that target anxiety generally utilize a pre/post-test design which can indicate statistically significant differences between a clinical sample at two time points. When large sample sizes are available, tests of statistical significance are often used in mental health research to evaluate whether or not treatments are associated with client change. Statistical significance measures how likely any differences Bischoff, Anderson, Heafner, & Tambling in outcome between treatment and control groups are real and not due to chance (Leung, 2001). Even though this is an indispensable approach to research the effectiveness of treatment, it bears some limitations. For example, Cohen (1994) demonstrated that given a large enough sample, any difference can be statistically significant even if it lacks real-world significance. Furthermore, statistical significance does not indicate whether differences that occur are meaningful. Kendall, Marrs-Garcia, Nath, and Sheldrick (1999) referred to this quality as the "convincingness of the amount of change linked to treatment" (p. 295; emphasis in original). It is important to establish that change is meaningful and not due to error in the measurement so that clinicians, researchers, and clients themselves can objectively corroborate the subjective experiences of change that occur within clinical treatments.
To assess whether or not changes are meaningful, clinicians are beginning to evaluate results with an eye for clinical significance (Kazdin, 1999). Clinical significance measures how large treatment effects are in clinical practice (Leung, 2001). Whereas the methods described above are useful for research purposes with large sample sizes, it is equally important for clinicians to have a useful tool that they can use to evaluate change on an individual, client-by-client basis. Establishing clinical significance (i.e. that a change within an individual client from Time 1 to Time 2 is real and not due to chance) is important for clinicians who want to determine and demonstrate quantitatively measurable change. Clients may also benefit from such knowledge. For example, two studies have shown that clients who received information regarding their measurement results reported higher levels of self-esteem and hope, and reported fewer symptoms compared to those who did not receive such information (Finn & Tonsager, 1992;Newman & Greenway, 1997). Kazdin (1999) also suggested clinical significance is important at the societal level, particularly in regard to issues of managed care, reimbursement, and accountability. Thus, understanding and establishing clinical significance of widely used measures may be beneficial at multiple levels of the mental healthcare system.
There are various methods for establishing meaningful clinical change (Bauer, Lambert, & Nielsen, 2004;Ferrer & Pardo, 2014;Wise, 2004). Jacobson and Truax (1991) developed what is perhaps the preferred method (Ferrer & Pardo, 2014;Wise, 2004). This involves establishing a reliable change index (RCI), which is a minimum difference between a Time 1 score and Time 2 score. This method requires that two conditions must be met in order to establish this type of change. The first criterion for doing so according to their method is the cutoff score, which refers to the lowest or highest possible score for an individual to call within a particular category. Any changes in score must move across the cutoff score, for example going from moderate to severe anxiety, in order to be considered meaningful or clinically significant change. For this study, established cutoff off scores will be used. Cutoff scores will be further described in the Method section below.
The second criterion is determining that the change is statistically reliable (Jacobson & Truax, 1991). Classical test theory holds that an observed score on a measure is a combination of the true score and measurement error. In order to establish confidence that changes in scores across time that represent real changes in an individual's anxiety and are not due to measurement error, a reliable change index (RCI) must be established.
The RCI helps establish that the change is not due to chance or error, but rather to real change. The following equation represents a reliable change index: In this equation, x 2 − x 1 represents an individual's change between administrations of the instrument, S diff the standard error (SE) of the difference between the two scores, is defined in the following equations: S diff accounts for the variation in reliability of the test instrument, and represents the standard deviation of the clinical population at intake (s 1 ) and the test-retest reliability (r xx ) of the instrument in a non-clinical sample.
In order to solve for the formulas, it is necessary to collect data from two different samples. One is a clinical sample (in order to calculate the standard deviation for a clinical population) and the other is a community sample (in order to establish test-retest reliability in a non-clinical population). One previous study has found an RCI of the GAD-7 (Gyani, Shafran, Layard, & Clark, 2013), but the measure of reliability these researchers used was Cronbach's alpha rather than the test-retest reliability. Using the test-retest reliability provides a more accurate indicator of the instrument's reliability over time than the alpha, which has been shown to provide an overestimate of reliability and therefore an unacceptable rate of false positives (Ferrer & Pardo, 2014).
The purpose of this study is to provide clinicians and researchers with information to determine whether clients have made clinically significant change in anxiety as measured by the GAD-7. Using previously established cutoff scores and Jacobson and Truax's (1991) method, we will establish a Reliable Change Index which, when applied to two administrations of the GAD-7, will indicate if a client has experienced meaningful change.
The current research is modeled after a previous paper which established the Reliable Change Index for the Revised Dyadic Adjustment Scale (Anderson et al., 2014).

Method Clinical Sample
All study procedures were approved by the [redacted] Institutional Review Board, and informed consent was obtained from all individual participants included in the study. This study utilized archival data from clients seen for at least one therapy session at a university clinic in the Northeastern United States between 2008 and 2012. A total of 829 cases began therapy from March 2010 to September 2014. Clients were included in the study if they scored lower than the cutoff of 25 (i.e. in the "distressed" range) on a measure of general functioning, the

Community Sample
A sample was drawn from the general population in order to determine test-retest reliability of the GAD-7.
A non-clinical sample must be drawn in order to test this property because presumably clients would be undergoing change while receiving therapy, which would impact test-retest reliability scores. Participants were recruited through Amazon Mechanical Turk (MTurk). MTurk is an online market where tasks (in this case a survey) are posted for respondents to complete for a specified rate. Typical tasks may include choosing between potential photographs for an advertisement, writing product descriptions, or filling out surveys and pay rates generally range from $5 to $7 an hour for survey work. The site is hosted by Amazon.com and all respondents have an Amazon account through which all transactions are handled. As such, each respondent is completely anonymous to the researcher who posts the "job".
MTurk workers interested in participating viewed the job description, which included the exclusion criteria that participants must be at least 18 years old, currently residing in the United States, and not currently seeing a therapist. This final criterion was selected to capture an estimated test-retest reliability as precise as possible.
In order to further determine if this sample was distinct from the clinical sample in terms of mental health distress, we also assessed these participants' level of depression using the Major Depression Inventory (MDI; Olsen, Jensen, Noerholm, Martiny, & Bech, 2003

Measures Generalized Anxiety Disorder-7 Scale (GAD-7)
The GAD-7 is a seven-item self-report measure of the severity of the symptoms of generalized anxiety disorder (Spitzer et al., 2006). Instructions ask: "Over the last 2 weeks, how often have you been bothered by the following problems?". Example items include: "Feeling nervous, anxious, or on edge" and "Being so restless Reliable Change Index for the GAD-7 that it's hard to sit still." Respondents answer on a 0 to 3 scale from "Not at all" to "Nearly every day". Scores range from 0 to 21 with higher scores indicating greater levels of anxiety. Scores of 0 to 4 indicate minimal anxiety; 5 to 9 mild anxiety; 10 to 14 moderate anxiety; and 15 to 21 severe anxiety. In other words, the scores of 5, 10, and 15 are the cutoff scores for mild, moderate, and severe anxiety, respectively (Spitzer et al., 2006).
The scale has been found to have good internal reliability (Cronbach's alpha = .90; Spitzer et al., 2006). For the present study, alpha scores were also good (.89 for clinical sample; .94 for community sample at Time 1; .94 for community sample at Time 2).

Major Depression Inventory (MDI)
The MDI is a 10-item self-report measure of the level and severity of depression (Olsen et al., 2003). Instructions provide the prompt, "How much of the time…" and respondents answer on a 0 to 5 scale from "At no time" to "All the time".  (Cuijpers, Dekker, Noteboom, Smits, & Peen, 2007). This scale has been found to have good internal reliability (Cronbach's alpha = .90; Olsen et al., 2003). For the present study, alpha scores were also good (.87 for clinical sample; .95 for community sample at Time 1; .95 for community sample at Time 2).

Results
For the GAD-7, the mean score for the clinical sample was 10.57 (range from 0 to 21, SD = 5.6). For the community sample at Time 1, the mean score was 4.14 (range from 0 to 21, SD = 4.96). For the MDI, the mean score for the clinical sample was 23.71 (range from 0 to 50, SD = 10.28). For the community sample at Time 1, the mean score was 10.07 (range from 0 to 46, SD = 11.53). As to be expected, mean scores were lower for the community sample on both measures and are in the lowest category of distress. These sample characteristics lend confidence to our assumption that the community sample is indeed "nonclinical".
The means and standard deviations for both the clinical and community samples (at Time 1) are represented in Table 1.
A Pearson's correlation was computed to assess the 14-28-day test-retest reliability of the GAD-7, r(110) = .87, indicating good test-retest reliability. Jacobson and Truax's (1991) method for determining reliable change was used to determine the amount of change in score from pretest to posttest that would be statistically significant at the p = .05 level for the GAD-7. We used the community sample's test-retest reliability estimate (r xx = .87), as well as the standard deviation of the clinical sample at intake (s 1 = 5.60) as inputs for Equation 3. These values are presented in Table 1

Discussion
When evaluating client change it is important to be able to determine that measured change is real and not due to measurement error. Further, it is important to determine if change in single cases is clinically significant.
To accomplish these goals, it is necessary to know how much change in score on a given instrument is a reliable change. This study established a reliable change index for a commonly used measure of anxiety. The Jacobson and Truax (1991) method was followed, with the RCI calculated using the means and standard deviations of both a clinical and community sample. Results indicated that an individual whose score on the GAD-7 moves across the cutoff of 10 and changes by six or more points from the first to the most recent administration can be classified as experiencing clinically significant change. Changes of at least six points that do not cross a cutoff can still be classified as either reliable improvement or deterioration (depending on the direction); however, they are not considered clinically significant if the client does not cross over from one severity classification to another. This finding is important because it offers a way in which clinicians and researchers can judge the importance of changes in scores on the GAD-7.

Limitations
Though this study seeks to establish a standardized reliable change index, it should be noted that multiple choices impact the findings. First, the time frame that is selected for the test-retest reliability affects the RCI, because any variation in this result impacts the test-retest reliability estimate. For this study, a two-to-four-week time period was given for participants to complete the retest. This time frame was chosen to reflect the common interval between administrations of the GAD-7 in routine clinical practice. Though participants completed the second survey an average of 15.8 days after the first survey, the variability in the exact time frame in which the retest was completed and the selection of this particular frame impacts the test-retest reliability results.
For example, the lower the test-retest reliability correlation, the less precise the measure is across time.
Subsequently, the RCI would increase.
The unique composition of the clinical sample also impacted results, and features of the sample should be taken into consideration when interpreting results. Participant clients were seen for a minimum of one session at a university clinic. The sample represents clients with a variety of presenting problems, participating in different treatment modalities, including individual, couple, and family therapy, in a treatment-as-usual condition.
While this diversity in the sample increases the external validity of the results of this research, the variability in sample also inflates the standard deviation of the GAD-7, which in turn influences the RCI. In order to preserve the variability in presenting problems and treatment modalities while minimizing the effect on standard deviation, we limited the clinical sample to those individuals who scored in the distressed range on a measure of general functioning. Given the target population for the measure is distressed individuals in a clinical setting, Reliable Change Index for the GAD-7 we believe this was a warranted inclusion criterion. In sum, while efforts were undertaken to minimize threats to the reliability of the RCI, the sample diversity was maintained in an effort to enhance external validity.
We compare our results to those of Gyani et al. (2013), who found a cutoff score for the GAD-7 of 3.53 by using an estimate of internal consistency rather than test-retest reliability (α = .90 compared to our r xx = .87) and a standard deviation of a clinical sample that included only clients who scored above a cutoff for either anxiety or depression (SD = 4.41 compared to our SD = 5.6). Whereas both estimates of variability (i.e. the reliability of the measure and the standard deviation) impact the reliable change index, it may be that concerns about overestimating reliability by using a measure of internal consistency rather than the test-retest reliability may be less warranted than previously thought (Ferrer & Pardo, 2014). On the other hand, a future direction for research may be to more closely examine the impact of selection criteria for clinical samples when they are used to calculate an RCI.
Another limitation of this study concerns the generalizability of the findings. As the clients in the clinical sample were drawn from a university training facility, clients who seek services at this type of facility may differ from those who seek services as community-based agencies or private practices. Additionally, sampling persons on MTurk versus other potential community members could prove to be a limitation to this study as MTurk may attract a sample that is not representative of the overall population. The two samples are different in several ways, including age, education, income level, and therapy involvement. Sample respondents also provided data via two different mechanisms. While not anticipated to be statistically relevant, it is possible that differences in sample in some way impacted results. Furthemore, both the clinical and community samples were predominantly White. We recognize that the samples used in this study may limit the generalizability of the findings.
The researchers attempted to obtain a clinical sample that was as close to a 'treatment as usual' condition as possible. Respondents represent a variety of treatment modalities, presenting problems, and personal demographic conditions. While this strategy increased external validity by more closely matching conditions in community clinics, there were threats to internal validity posed by the heterogeneous nature of the sample.
Further study is needed, with both homogeneous and heterogeneous samples, to establish the stability of the RCI.
Finally, it must be noted that the method we chose for establishing clinically significant change (Jacobson & Truax, 1991) itself is not 100% reliable. Despite offering an acceptably low rate of false positives (Ferrer & Pardo, 2014), individual change scores on the GAD-7 should still be interpreted with caution and involve taking clinical expertise and client self-report into account (Bornstein, 2017;Kazdin, 1999).

Implications
The GAD-7 is a widely used assessment that is used to establish diagnoses during the intake process. The establishment of a reliable change index broadens the utility of this measure so that it can be reliably used over time. Cutoff scores and reliable change indices are helpful to clinicians who want to examine their client's progress in a manner that is in accordance with results-based accountability standards. These standards are useful to clinicians who want to establish the effectiveness of their treatment with various clients. Clinicians could anchor these findings in their practice by administering the GAD-7 at regular intervals and monitoring any changes. Administering measures repeatedly like this typically would come at a high cost to practicing Bischoff, Anderson, Heafner, & Tambling clinicians, as instruments cost money and/or may be difficult to obtain. However, the GAD-7 is free and easily accessible online. By using the RCI with the GAD-7, clinicians have an empirically supported measurement to assess client progress, improve treatment process, satisfy the demands of managed care, and thereby receive approval for continuing treatment.
The standards are also useful to researchers who want to explore effectiveness across models or conditions of clinical treatment. Though there are many ways to measure client change, reliable change indices can provide a degree of confidence that the reported change is not due to error in the instrument. Researchers could also anchor the current findings in their work by testing associations between clinically significant change and aspects of therapy that purport to create change. Furthermore, previous findings from research that used the GAD-7 could be enhanced, corrected, or re-verified by implementing use of the RCI developed by this study.
This study also adds strength to and promotes the need for establishing RCIs for other mental health and relationship assessment devices.
Finally, it is our ultimate aim that clients and the community will benefit the most from the establishment of reliable change indices. First, clients can benefit directly by understanding that a measurement tool validates their degree of change. It can also empower them to continue progressing towards a healthier self. They also benefit indirectly through the clinicians that are able to provide better services to them. Implementation of a measurement that indicates whether or not meaningful change has occurred may more quickly improve client problems. This type of care will likely lead to positive feelings among clients that can then lead to more healthy behaviours and productivity in their community. Furthermore, if a community of health workers (e.g. physicians, therapists, psychiatrists, etc.) have similar understanding of and utilize the meaningful change for the GAD-7 then greater collaborative treatment can occur, providing clients the best possible outcome.

Funding
This study was funded by the Department of Human Development and Family Studies at the University of Connecticut.