Journal of Operational Risk
ISSN:
1744-6740 (print)
1755-2710 (online)
Editor-in-chief: Marcelo Cruz
Need to know
- Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data.
- Resampling is a common technique in building machine learning models to address imbalanced data issue. This study shows that balanced data does not necessarily result in the best model performance.
- The degree to which different algorithms are impacted by class imbalance is not equal. It is demonstrated that certain classification algorithms, especially tree-based algorithms perform very well in the presence of imbalance, while other algorithms such as neural network and logistic regression perform very poorly.
Abstract
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners’ attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@risk.net
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@risk.net