Volume 68 Number 3
Federal Probation
 
     
     
 
Assessing the Inter-rater Agreement of the Level of Service Inventory Revised
 

Christopher T. Lowenkamp, Ph.D., University of Cincinnati
Alexander M. Holsinger, Ph.D., University of Missouri Kansas City
Lori Brusman-Lovins, M.S.W., University of Cincinnati
Edward J. Latessa, Ph.D., University of Cincinnati

Introduction
Methodology
Results
Discussion


Introduction

Risk assessment has a long history in corrections. In reviewing the types of assessment practices available, Bonta (1996) identifies three generations of risk assessments. Each of these assessment processes possesses advantages and disadvantages. For example, the first generation of risk assessments, also known as quasi-clinical or subjective assessments, allows for deviation from the assessment protocol when necessary, but has proven to be lacking in predictive accuracy (Bonta, Law, & Hanson, 1998; Hanson & Busiere, 1998; and Mossman, 1994). Second generation assessments are objective and empirically based, but often focus on criminal history and a host of atheoretical (and static) factors. While the second generation of risk assessments has been fairly accurate in regard to prediction and easy to score, very little can be garnered from this second generation that leads to the development of a meaningful intervention plan (Andrews & Bonta, 2003). The third generation of risk assessments is also objective and empirically based. What makes them more useful for developing case planning is their dynamic measurement of risk factors and the quality and breadth of information collected. But this advantage is also a disadvantage. Measuring dynamic risk factors and scoring a detailed and comprehensive risk assessment requires specialized knowledge of the assessment process and the items contained therein.

The advent of the third generation of risk assessments has provided much hope for correctional interventions. Such assessments can be used not only to identify high-risk offenders, but to determine what factors exist in an individual's life that cause him or her to be high risk (Lowenkamp & Latessa, 2002). Such a determination provides meaningful targets for interventions. If these targets are properly addressed, reduction in risk and subsequently the likelihood of recidivism follow. In the aggregate, these instruments can assist correctional agencies in increasing public safety. However, as noted earlier, completing these assessments requires training and considerably more care than completing other assessment methods. The LSI-R, the focus of the current research, requires knowledge of psychological testing in general and specialized knowledge of how to score the risk assessment itself (Andrews & Bonta, 2001). Included in the LSI-R are 10 criminogenic domains (criminal history, education/employment, financial, family relationships, accommodations, leisure and recreation, companions, substance use, emotional health, and attitudes/orientations). Additional reviews of other risk assessments and staff ability to complete and integrate the information in daily decision-making activities also leads to the recommendation of staff training and development (Andrews & Bonta 2003).

One concern, among others, that makes training necessary is the reliability of the completed assessment (for a more complete description of potential threats to the utility of risk assessments, see Bonta, Bogue, Crowley and Motiuk, 2001). With any interview-based, dynamic offender assessment process, reliability in scoring is essential. Often the issue of individual subjectivity is raised due to the ways in which information is gathered, and the scoring criteria that accompany third generation risk/need assessment instruments. One of the major advantages of instruments such as the LSI-R is the potential for standardization in classification and assessment. In other words, when conducted properly, use of the LSI-R may help reduce bias in decision making, create a logical classification strategy, and offer information that can be used to create detailed, dynamic case planning. In light of the weight of the decisions that can be informed through using the LSI-R, inter-rater reliability becomes a critical issue.

To date, much of the research involving the LSI-R has focused on the predictive validity of the tool (Andrews, 1982; Andrews, and Robinson, 1984; Bonta and Andrews, 1993). The LSI-R, through the provision of a composite additive score, should offer a valid scale where high scores are associated with a high probability of recidivism. Conversely, low scores on the LSI-R should represent a low probability of recidivism. The validity of the LSI-R has been shown in a variety of correctional settings and with a variety of offender sub-groups (Lowenkamp and Latessa, 2002; Lowenkamp, Holsinger, and Latessa, 2001). More research is needed, however, regarding the reliability of the LSI-R scores across various raters.

Some studies have indirectly examined the inter-rater reliability of the LSI-R. One study utilized a Self Report Inventory that was derived from the LSI-R itself. The Self Report Inventory was designed to gather from offenders themselves measures similar to those on the LSI-R. The Self Report Inventory did demonstrate inter-rater reliability with the LSI-R, which would indicate congruence between the information gathered by correctional professionals and the information provided by offenders themselves (Motiuk, Motiuk and Bonta 1992). While these results are encouraging, they do not demonstrate reliability across the group of professionals conducting the LSI-R, but rather reliability between the offender being assessed and the professional conducting the assessment. In addition, other research has demonstrated the reliability of the LSI-R, compared to the reliability of other risk assessment tools such as the PCL-R (Gendreau, Goggin, and Smith 2002). While clearly this type of investigation is a necessary part of the correctional research landscape, the question regarding LSI-R scores across individual raters is left largely unanswered. In fact, much of the information about the reliability of the tool takes the form of measures of internal consistency (such as Chronbach's Alpha coefficient) for the 10 subscales present in the tool, as well as the tool as a whole (all 54 items together).

One way to increase inter-rater reliability within the LSI-R (as well as overall quality) may be through staff training (Flores, Lowenkamp, Holsinger, and Latessa 2004). The important effect of training has been demonstrated in other venues as well, involving offender assessments other than the LSI-R (Baird and Prestine, 1988). In order to further test the specific inter-rater reliability of the composite LSI-R score, research that uses a common example across a group of LSI-R raters is necessary. The current research utilizes a sample of correctional professionals, all of whom were formally trained in the use and implementation of the LSI-R. As part of the formal training, a common example was conveyed represented by a vignette containing current (dynamic) narrative information about a particular offender. After the training was complete, the participants were asked to utilize the common vignette to score the 54 items on the LSI-R. The ratings that resulted were used to test the inter-rater reliability of the LSI-R for a sample of trained correctional professionals. The results of the analyses are presented below.

There are several different methods for assessing an instrument's reliability. The focus in this research is inter-rater agreement or the extent to which independent raters converge in terms of their scoring of the same offender. To assess inter-rater reliability, 167 training participants independently completed an assessment of an offender vignette at the conclusion of a three-day training on the principles of offender classification and the use of the LSI-R in particular.

back to top

Methodology

Participants

The participants in this study are 167 correctional practitioners from a large Western state. While data on the individual participants was not collected, participants included males and females, individuals of various races, and those working with offenders in the community and in institutions.

Procedures

The participants in this training were part of a three-day training required of all correctional staff. The training covered the intent and scoring criterion for each of the 54 items on the LSI-R. At the end of the training, the participants were given an exam that included a vignette describing an offender. This vignette covered all areas represented on the LSI-R. Participants were instructed to complete the exam and score an LSI-R based on the vignette independently. A facilitator was present during the completion of the exam. The scores from the LSI-R scoring forms were entered into a database for analyses.

Analyses

Since there was only one assessment for each of the 167 raters, traditional tests of inter-rater reliability are not possible. However, investigating the percentage of agreement for each item, descriptives for the total score, and agreement for overall classification are possible. To assess the reliability of the LSIR, the percentage of raters that scored each LSI-R item was calculated. Marks for each item were coded as either indicating a risk factor, indicating that the item was not a risk factor, circled items, and items that were left blank. Average agreement percentages were calculated for each section and for the entire instrument. Descriptive statistics were calculated on the overall score and the final classification based on that score.

back to top

Results

Table 1 presents by-item results for the first four sections of the LSI-R (Criminal History, Education/Employment, Financial, and Family/Marital). The percentage of the respondents who scored each item in a particular way is presented. There were four possibilities for each item: 1) marking an item as a risk factor, 2) marking an item as a non-risk factor, 3) circling an item, indicating that not enough information was present to assess it either way, and 4) leaving the item blank. For the Criminal History section, the agreement was very high. For nine of the ten items, agreement ranged from 86 percent to 100 percent. The lowest agreement occurred for "ever punished for institutional misconduct," where 55 percent of the sample scored the item as a non-risk factor. All the items within the Education and Employment section had a percentage agreement ranging between 95 percent and 99 percent. The Financial section, with only two items, had fairly low agreement by comparison, with 66 percent and 57 percent of the sample in agreement. Three of the four items in the Family and Marital section had very high agreement, ranging between 94 percent and 100 percent. The last item in this section had an agreement rate of 51 percent.

Table 2 presents by-item results for the remaining six sections of the LSI-R (Accommodation, Leisure/Recreation, Companions, Alcohol/Drug, Emotional/Personal, and Attitudes and Orientations). The three items in the Accommodations section had very high rates of agreement (96 to 99 percent).

Similarly, the Leisure and Recreation section also showed high rates of agreement for the two items in the section (90 and 98 percent). Likewise, the five items in the Companions section had high rates of agreement (89 to 100 percent). The first seven (of nine) items in the Alcohol and Drug section had very high rates of agreement amongst the raters (92 to 100 percent). However, the last two items had agreement rates of lesser magnitude (56 and 72 percent). Four of the five items in the Emotional and Personal section had high rates of agreement, ranging from 89 to 99 percent. One item had a moderate rate of agreement, at 65 percent. Finally, the Attitudes and Orientations section showed high rates of agreement as well, with the four items ranging in agreement from 82 to 98 percent. Overall, the agreement rate for the 54 items taken as a whole was very high for the sample.

Table 3 presents the average agreement rates for each subsection. Nine of the 10 subsections had average agreement rates of 85 percent or above (the Accommodations section with three items had the highest average agreement rate at nearly 98 percent). The Financial section, however, had the lowest average agreement rate, at 61.5 percent.

Table 4 presents the portions of the sample that placed the offender in each category of risk (using Multi-Health System's prescribed cut-off scores). A large majority of the subjects in the sample—86 percent—assessed the offender in the vignette as having a composite score that placed them into the Medium/High category of risk. These results are particularly important when considering the importance of classifying offenders objectively, and allowing agencies to incorporate the Risk principle of correctional classification and intervention. Regardless of slight differences that may have occurred across a handful of the items across raters, overall, the average rates of agreement were acceptable to very high for each subsection, and a very high proportion of the sample were assessing the offender as being at the same level of risk.

back to top

Discussion

The goals of the current research were fairly modest. After being trained on the use of the LSI-R, practitioners were tested as to whether or not they agreed on the scoring of the 54 items present on the assessment. However modest these goals were, the process of determining whether or not practitioners can reliably score the LSI-R assessment is an important issue. The LSI-R and the information it gleans can be used to inform several decision points throughout the processing of offenders. In addition, both the Risk and Need principles can be met via the use of the LSI-R. In light of these aspects of offender assessment and classification, and the fact that the LSI-R is an example of a proprietary tool that requires agencies to commit resources, inter-rater reliability becomes even more important. Based on the results presented above, properly trained practitioners do exhibit high levels of agreement across virtually all the items in the 54-point scale. Even considering that a small number of items had moderate rates of agreement, the overall average agreement rates for all 10 subsections were acceptable to very high. In most cases, where agreement rates were lower, that should be interpreted in light of the fact that the subjects had only attempted to conduct the assessment two prior times. As such, with continued practice and quality assurance checks, rater agreement should only increase over time. This assumption is supported by Flores et al. (2004), who found that the amount of time an agency uses the instrument and the implementation of formalized training on its use produce significant increases in the predictive validity of the tool.

Some limitations were inherent in the current research. For example, the LSI-R process, when conducted in the field, requires practitioners to gather their own data via one-on-one interviews with offenders and the consideration of multiple sources of collateral information. The subjects in the current study were given a tailor-made vignette that represented the information that should have been gathered had they been involved in a real-life assessment process. A true test of inter-rater reliability using the LSI-R would require pairs (or more) of subjects to gather their own information independently from the same source, after which the assessment would be scored accordingly. Doing this would also allow for the calculation of inferential statistics designed to more explicitly test rater agreement. Nonetheless, the research presented above represents a descriptive analysis that attempts to contribute to the knowledge base pertaining to inter-rater reliability with practitioners who have been trained in the use of the LSI-R.