Abstract
Background: Multivariate statistical process control (MVSPC) based on mixed-type data (MTD) is a very recent and little-known field. We review the possibilities for MVSPC with MTD (i.e., when some variables are numeric and some categorical, which is common in health care). The main approaches to this problem are: dimensionality reduction yielding numeric dimensions, nonparametric approach, machine-learning approach, and measuring distances between MTD-points (Gower distance, Euclidean distance). Our research focused on the latter, together with the Hotelling T2 statistic (which is the basis for MVSCP for numeric data).
Methods: We compared ten methods for MVSCP: local and global Euclidean distance, local and global Gower distance, standard T2, T2 using Gower distance with or without bootstrap, T2 using Gower distance with bootstrap based on principal component analysis, and permutational implementations of T2 using Gower distance and global Gower distance. We wrote an R function that performs one iteration of the simulations for each method and calculates observed type I error (test size) and sensitivity (proportion of correctly identified simulated out-of-control cases) based on the inference on the simulated cases (which were either in-control or out-of-control). In the second part of our research, we tested the methods using a dataset on 100 patients after amputation who received a permanent transtibial prosthesis at the University Rehabilitation Institute in Ljubljana in 2014.
Results: In general, observed type I error was not problematic, except with the T2 method (too high when numeric variables were asymmetrically distributed, and too low otherwise). The Gower distance method performed better with a larger number of categorical variables and with asymmetrically distributed numeric variables. In the majority of the simulations (i.e., except when the deviations of the out-of-control cases were very large), sensitivity turned out to be low. In terms of observed type I error, several methods proved to be adequate: T2 using Gower distance with bootstrap based on PCA, permutational global Gower distance, T2 using Gower distance with bootstrap, global Gower distance and global Euclidean distance. When testing the methods on the real data, local Euclidean distance and local Gower distance were in the highest agreement with the actual in- or out-of-control status. The latter also had the lowest observed type I error rate.
Conclusion: Our finding can lead to improvement of multivariate analyses in the field of health care quality, where mixed data are often encountered. The proposed methods were applied on a dataset from the field of health care provision and proved to be useful in practice.
|