The master's thesis deals with the problem of estimating the risk factors of firms on the basis of accounting and firmographic data, with an emphasis on statistical methods used for their estimation.
Credit risk has received a great deal of interest in recent decades, and methods for estimating it have shifted from subjective models of expert opinion to advanced quantitative methods. This is also due to the growing amount of information available and the ease of access to information and software. The initial models were often based on pooled data from long periods of time and worked on the principle of classifying companies into two distinct groups according to their characteristics at a single point in time. In recent years, they have been overshadowed by discrete-time hazard models, which have became state of the art in predicting risk. Discrete-time hazard models on panel data are estimated using logistic regression and they take into account the volatile nature of companies' financial structure and other characteristics. Furthermore, they are less subject to sample bias, which makes both estimates of risk and risk factors more accurate and descriptive for the study population. Although these models are often recommended and used, their performance is most often represented only by AUC, a measure of the discrimination power of classification models, and without looking into the new possibilities that the model allows.
In order to study the differences between popular methods, a discrete-time hazard model without time varying-covariates and a logistic regression model were developed in addition to the main discrete-time hazard model with time-varying covariates. The findings of all models show that the profitability, liquidity and indebtedness of companies with operating difficulties differ from those without operating difficulties and that non-financial variables complement the information given by the financial variables. Using the last available data to estimate the logistic regression model leads to the best discrimination power, but due to sampling bias, the model significantly overestimates the actual probability of a firm having operating difficulties. Because the incidence of the events and the effects of risk factors differ on the period-by-period basis, pooling data from several years caused the logistic regression estimates of risk factors to deviate from the observed effects in some periods. From the comparison of discrete-time hazard models, it is obvious that in order for the model to be effective, it is necessary to take into account the most recent data of the firm's characteristics, as the information and relevance of the data diminishes with time. Although all three models are valid and relevant, the discrete-time hazard model with time-varying covariates uses more information and is better at describing the risk factors and at capturing firms with operating difficulties in all periods. The model can be further improved by including random effects, but in this thesis the effects of including them have not been investigated in detail.
Another advantage of hazard models is that in addition to estimating the hazard and the risk factors, they also allow for the comparison of companies using their hazard profiles. The estimated hazards in individual periods can be used to adequately estimate the probability of "surviving" up to a certain period, which can be very useful in estimating the risk of loans with longer maturities. Last but not least, the estimated survival probabilities allow an alternative approach to estimating discrimination power with time-dependent discrimination indices AUC(t) and Ctd}. Since they provide insight into the discrimination power of the model both in each period and in the whole follow-up period, the time-dependent measures prove to be more informative of the model's ability to discriminate between firms based on operating difficulties. Apart from that they also point to the weakness of the established measure of discrimination power, which may depend too much on the distribution of the number and rate of events across time periods.
|