A New Method to Evaluate the Performance of Models Trained with Synthetic Data When They are Applied to Real-World Data

On Jan 22, 2023

Credit scoring models are crucial in assessing and managing credit risk within financial institutions. However, it is limited due to challenges in obtaining data from financial institutions to protect borrowers’ private information. Generative models for synthetic data generation can provide a solution by creating synthetic data that resembles real-world data, allowing for research without compromising privacy. Synthetic data can also improve the accuracy of credit scoring models by augmenting limited real-world data.

The use of synthetic data in credit scoring has been mainly limited to addressing imbalanced data in classification problems using techniques such as SMOTE, variational autoencoders, and generative adversarial networks. These methods have been proposed and used in recent studies to generate synthetic data that can be used to balance the minority class and improve the accuracy of credit scoring models. Recently, a new paper introduced a novel framework for training credit scoring models on synthetic data and applying them to real-world data while also analyzing the model’s ability to handle data drift. The main findings suggest that it is possible to train a model on synthetic data that performs well but with a performance cost for working in a privacy-preserving environment, resulting in a loss of predictive power.

In the proposed work, a dataset provided by a financial institution is used, which includes borrower financial information and social interaction features over two periods, January 2018 and January 2019, each containing 500,000 individuals. The borrowers are labeled based on their payment behavior in the following 12-month observation period. To generate synthetic data that mimics real-world behavior and maintains privacy, two state-of-the-art synthetic data generators, CTGAN and TVAE are compared using different configurations, and the best one is selected. Then, a new synthesizer is trained using the best configuration, and the feature set is expanded with social interaction features. Finally, a framework to estimate borrowers’ creditworthiness is proposed, using feature selection and a K-fold cross-validation scheme. The performance is evaluated using various metrics, such as AUC, KS, and F1-score.

The authors implemented the methodology using Python’s Networkx and Synthetic Data Vault libraries. The performance of the two synthetic data generators, CTGAN and TVAE, were compared using two different architectures and different feature sets. The results show that TVAE had faster execution times and better performance in synthesizing both continuous and categorical features. Additionally, a logistic regression model was trained to distinguish between real and synthetic data, and the results indicate that TVAE achieved the best performance. Still, this performance decreased as more features were included in the synthesizer. The authors compared the performance of creditworthiness assessment models trained on synthetic data and real-world data. They trained classifiers using real-world data and tested their performance using holdout datasets. The results show that the gradient boosting algorithm achieved better performance compared to logistic regression. They also trained classifiers using synthetic data and applied them to real-world data. The results indicate that the model’s performance was similar when trained on synthetic data, except in one case. The performance comparison between models trained on synthetic data and real-world data shows a cost to using synthetic data, which corresponds to a loss of predictive power of approximately 3% and 6% when measured in AUC and KS, respectively.

In this article, we presented a study using synthetic data generation to research credit scoring while protecting borrowers’ privacy. The proposed framework trains models on synthetic data and applies them to real-world data while analyzing their ability to handle data drift. The results show that models trained on synthetic data can perform well but with a loss of predictive power. The study also found that TVAE had better performance than CTGAN, and there is a cost in terms of a loss of predictive power when using synthetic data.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit Page, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Mahmoud is a PhD researcher in machine learning. He also holds a
bachelor’s degree in physical science and a master’s degree in
telecommunications and networking systems. His current areas of
research concern computer vision, stock market prediction and deep
learning. He produced several scientific articles about person re-
identification and the study of the robustness and stability of deep
networks.

Credit: Source link