In classical test theory procedures (e.g. simple summation of responses), we have to take into account some limitations, due to which we lose some information when calculating the test score. One of the limitations is that the ordinal scale is treated as interval (e.g., 5-point Likert scale; 1-Strongly disagree, 2-Disagree, 3-Neither agree nor disagree, 4-Agree, 5-Strongly agree). Verbal descriptions may also indicate unequal differences between the responsible categories. This makes simple summation of responses questionable. At most, we can attempt to approximate the linear relationship of the variables by transforming the scale values, thereby approximating the interval scale. This can be done with optimal scaling methods, which attempt to linearize the relationship between the variables. Alternatively, in the latent variable theory, factor analysis attempts to approximate the interval measure by introducing continuous latent response. In our study, we used computer simulation to investigate the influence of the number of categories, the length of the test, the sample size, and the pattern of distances between response categories on the reliability and validity of the test under three scaling methods: simple summation of answers, factor analysis for categorical variables and optimal scaling. The Likert-type items had 4, 5, and 7 response categories with 10, 20, or 30 items, three sample sizes (100, 200, and 500 simulated participants), and three different distances between response categories (equal distances, moderately asymmetric, and strongly asymmetric distances). For each of the 81 conditions, we generated a matrix of responses, on which we calculated the final test scores using three methods. We then calculated internal consistency and criterion validity for each of the final test scores. In general, the best results in terms of both reliability and validity were obtained with factor analysis, followed by optimal scaling and simple summation of answers. All four experimental factors were found to be statistically significant. On the basis of the results, we made some recommendations for scoring items that use ordinal or Likert-type scales.
|