Communications - Scientific Letters of the University of Zilina 2022, 24(3):D105-D115 | DOI: 10.26552/com.C.2022.3.D105-D115

Smote vs. Random Undersampling for Imbalanced Data - Car Ownership Demand Model

Wuttikrai Chaipanha ORCID..., Patiphan Kaewwichian ORCID...*
Department of Civil Engineering, Faculty of Engineering, Rajamangala University of Technology Isan, Khon Kaen, Thailand

Because the numbers of cars reflect each person's travel behaviors for each specific location, the car ownership demand model plays a dominant role in analysis of the travel demand in order to understand each area's individual and household travel behaviors. However, the study project for the master plan of the Khon Kaen expressway represented imbalanced data; namely, the majority class and the minority class were not equal. Before developing a machine learning model, this study suggested a solution to balance the data by using oversampling and under-sampling techniques. The data, which had been improved with SMOTE (Synthetic Minority Oversampling Technique) and kNN (k-nearest neighbors) (k = 5), demonstrated a better effect than the other algorithms that were studied. The TPR (true positive rate) for the rural and suburban areas, which are types of regions with very different imbalance ratios, was calculated before balancing the data at 46.9 % and 46.4 %. As a result, the TPR values were 63.5 % and 54.4 %, respectively, following the data balancing.

Keywords: tour-based model, multiclass classification, k-nearest neighbors, activity-based model

Received: July 28, 2021; Accepted: January 31, 2022; Prepublished online: March 25, 2022; Published: July 1, 2022  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Chaipanha, W., & Kaewwichian, P. (2022). Smote vs. Random Undersampling for Imbalanced Data - Car Ownership Demand Model. Communications - Scientific Letters of the University of Zilina24(3), D105-115. doi: 10.26552/com.C.2022.3.D105-D115
Download citation

References

  1. KAEWWICHIAN, P., TANWANICHKUL, L., PITAKSRINGKARN, J. Car ownership demand modeling using machine learning: decision trees and neural networks. International Journal of GEOMATE [online]. 2019, 17(62), p. 219-230. ISSN 2186-2982, eISSN 2186-2990. Available from: https://doi.org/10.21660/2019.62.94618 Go to original source...
  2. FEI, C., LIU, R., LI, Z., WANG, T., BAIG, F. N. (2021). Machine and deep learning algorithms for wearable health monitoring. In: Computational intelligence in healthcare [online]. MANOCHA, A. K., JAIN, S., SINGH, M., PAUL, S. (eds.). Cham: Springer international publishing, 2021. ISBN 978-3-030-68722-9, eISBN 978-3-030-68723-6, p. 105-160. Available from: https://doi.org/10.1007/978-3-030-68723-6_6 Go to original source...
  3. JANDACKA, D., DURCANSKA, D., KOVALOVA, D. Concentrations of traffic-related pollutants in the vicinity of different types of urban crossroads. Communications - Scientific Letters of the University of Zilina [online]. 2019, 21(1), p. 49-58. ISSN 1335-4205, eISSN 2585-7878. Available from: https://doi.org/10.26552/com.C.2019.1.49-58 Go to original source...
  4. DOUZAS, G., BACAO, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications [online]. 2018, 91, p. 464-471. ISSN 0957-4174. Available from: https://doi.org/10.1016/j.eswa.2017.09.030 Go to original source...
  5. BASU, R., FERREIRA, J. Understanding household vehicle ownership in Singapore through a comparison of econometric and machine learning models. Transportation Research Procedia [online]. 2020, 48, p. 1674-1693. ISSN 2352-1465. Available from: https://doi.org/10.1016/j.trpro.2020.08.207 Go to original source...
  6. JOHNSON, J. M., KHOSHGOFTAAR, T. M. Survey on deep learning with class imbalance. Journal of Big Data [online]. 2019, 6(1), p. 1-54. ISSN 2196-1115. Available from: https://doi.org/10.1186/s40537-019-0192-5 Go to original source...
  7. SMITH, M. R., MARTINEZ, T. The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks. Artificial Intelligence Review [online]. 2018, 49(1), p. 105-130. ISSN 0269-2821, eISSN 1573-7462. Available from: https://doi.org/10.1007/s10462-016-9518-2 Go to original source...
  8. BRANCO, P., TORGO, L., RIBEIRO, R. P. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR) [online]. 2016, 49(2), p. 1-50. ISSN 0360-0300. Available from: https://doi.org/10.1145/2907070 Go to original source...
  9. MAZUROWSKI, M. A., HABAS, P. A., ZURADA, J. M., LO, J. Y., BAKER, J. A., TOURASSI, G. D. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks [online]. 2018, 21(2-3), p. 427-436. ISSN 0893-6080. Available from: https://doi.org/10.1016/j.neunet.2007.12.031 Go to original source...
  10. VUTTIPITTAYAMONGKOL, P., ELYAN, E., PETROVSKI, A. On the class overlap problem in imbalanced data classification. Knowledge-Based Systems [online]. 2021, 212, 106631. ISSN 0950-7051. Available from: https://doi.org/10.1016/j.knosys.2020.106631 Go to original source...
  11. DENG, J., LORENZINI, K., KRAUS, E., PALETI, R., CASTRO, M., BHAT, C. Business process and logical model to support a tour-based travel demand. 2014.
  12. HOSENIE, Z., LYON, R. J., STAPPERS, B. W., MOOTOOVALOO, A. Comparing multiclass, binary and hierarchical machine learning classification schemes for variable stars. Monthly Notices of the Royal Astronomical Society [online]. 2019, 488(4), p. 4858-4872. ISSN 0035-8711, eISSN 1365-2966. Available from: https://doi.org/10.1093/mnras/stz1999 Go to original source...
  13. KAEWWICHIAN, P. Multiclass classification with imbalanced datasets for car ownership demand model - cost-sensitive learning. Promet - Traffic and Transportation [online]. 2021, 33(3), p. 361-371. ISSN 1848-4069. Available from: https://doi.org/10.7307/ptt.v33i3.3728 Go to original source...
  14. WANG, S., WANG, Q., ZHAO, J. Deep neural networks for choice analysis: extracting complete economic information for interpretation. Transportation Research Part C: Emerging Technologies [online]. 2020, 118, 102701. ISSN 0968-090X. Available from: https://doi.org/10.1016/j.trc.2020.102701 Go to original source...
  15. BIAGIONI, J. P., SZCZUREK, P., NELSON, P., MOHAMMADIAN, A. Tour-based mode choice modeling: using an ensemble of (un-) conditional data-mining classifiers. In: 88th Annual Meeting of the Transportation Research Board: proceedings. 2008.
  16. RIVAS-PEREA, P., COTA-RUIZ, J., PEREZ VENZOR, J. A., CHAPARRO, D. G., ROSILES, J.-G. Lp-SVR model selection using an inexact globalized quasi-newton strategy. Journal of Intelligent Learning Systems and Applications [online]. 2013, 5(1), p. 19-28. ISSN 2150-8402, eISSN 2150-8410. Available from: https://doi.org/10.4236/jilsa.2013.51003 Go to original source...
  17. CHAWLA, N. V., BOWYER, K. W., HALL, L. O., KEGELMEYER, W. P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research [online]. 2002, 16, p. 321-357. ISSN 1076-9757. Available from: https://doi.org/10.1613/jair.953 Go to original source...
  18. COVER, T., HART, P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory [online]. 2006, 13(1), p. 21-27. ISSN 0018-9448. Available from: https://doi.org/10.1109/TIT.1967.1053964 Go to original source...
  19. AGARWAL, Y., POORNALATHA, G. Analysis of the nearest neighbor classifiers: a review. In: Advances in Artificial Intelligence and Data Engineering: proceedings [online]. 2021. ISSN 2194-5357, eISSN 2194-5365. Available from: https://doi.org/10.1007/978-981-15-3514-7_43 Go to original source...
  20. MANI, I., ZHANG, I. kNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on Learning from Imbalanced Datasets: proceedings. Vol. 126. 2003.
  21. HRIC, M., CHMULIK, M., JARINA, R. Comparison of selected classification methods in automatic speaker identification. Communications - Scientific Letters of the University of Zilina [online]. 2011, 13(4), p. 20-24. ISSN 1335-4205, eISSN 2585-7878. Available from: http://komunikacie.uniza.sk/index.php/communications/article/view/873 Go to original source...
  22. AGRAWAL, R. Predictive analysis of breast cancer using machine learning techniques. Ingenieria Solidaria [online]. 2019, 15(3), p. 1-23. ISSN 2357-6014. Available from: https://doi.org/10.16925/2357-6014.2019.03.01 Go to original source...
  23. WOSYKA, J., PRIBYL, P. Decision trees as a tool for real-time travel time estimation on highways. Communications - Scientific Letters of the University of Zilina [online]. 2013, 15(2A), p. 11-16. ISSN 1335-4205, eISSN 2585-7878. Available from: http://komunikacie.uniza.sk/index.php/communications/article/view/648 Go to original source...
  24. WETS, G., VANHOOF, K., ARENTZE, T., TIMMERMANS, H. Identifying decision structures underlying activity patterns: an exploration of data mining algorithms. Transportation Research Record [online]. 2000, 1718(1), p. 1-9. ISSN 0361-1981. Available from: https://doi.org/10.3141/1718-01 Go to original source...
  25. ZHANG, S., LI, X., ZONG, M., ZHU, X., WANG, R. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems [online]. 2018, 29(5), p. 1774-1785. ISSN 2162-237X. Available from: DOI: 10.1109/TNNLS.2017.2673241 Go to original source...
  26. ALLEN, W. B., LIU, D., SINGER, S. Accessibility measures of U.S. metropolitan areas. Transportation Research Part B: Methodological. 1993, 27(6), p.439-449. ISSN 0191-2615. Go to original source...
  27. GARCIA, S., HERRERA, F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolutionary Computation [online]. 2009, 17(3), p. 275-306. ISSN 1063-6560. Available from: https://doi.org/10.1162/evco.2009.17.3.275 Go to original source...
  28. BUDA, M., MAKI, A., MAZUROWSKI, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks [online]. 2018, 106, p. 249-259. ISSN 0893-6080. Available from: https://doi.org/10.1016/j.neunet.2018.07.011 Go to original source...
  29. FERNANDEZ, A., GARCIA, S., HERRERA, F., CHAWLA, N. V. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research [online]. 2018, 61, p. 863-905. ISSN 1076-9757. Available from: https://doi.org/10.1613/jair.1.11192 Go to original source...
  30. KRISHNAVENI, C. SOBHA RANI, T. On the classification of imbalanced datasets. International Journal of Computer Science and Technology IJCST [online]. 2011, 2(SP1), p. 145-148. ISSN 0976-8491, eISSN 2229-4333. Available from: https://doi.org/10.13140/RG.2.2.14964.24961 Go to original source...
  31. NAPIERALA, K., STEFANOWSKI, J. Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems [online]. 2016, 46(3), p. 563-597. ISSN 0925-9902, eISSN 1573-7675. Available from: https://doi.org/10.1007/s10844-015-0368-1 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.