Application of Python and machine learning in processing and classifying import-export data
Abstract
Vietnam’s import-export data is increasing substantially in both scale and complexity, creating significant challenges in standardizing and classifying customs declaration information. This study proposes an automated data-processing pipeline implemented in the Python programming language, with the objective of enhancing efficiency and ensuring greater consistency in the analysis of customs information. The input dataset comprises more than 10,000 real-world import-export records, which are processed through a structured sequence of technical steps, including product name normalization, unit conversion, computation of quantitative indicators, and keyword-based product group labeling.
The experimental results demonstrate that this processing pipeline operates effectively on medium-scale, high-complexity datasets, while considerably improving classification accuracy and ensuring uniformity across product categories. Based on these findings, the authors propose integrating machine learning models as a supplementary tool to enhance generalization capabilities and adaptability to exceptional cases - particularly relevant in a trade environment where product names are increasingly diverse, unstandardized, and continuously evolving.
References
Kelvin Kelvin, Wahidin Wahab, and Meirista Wulandari, “Computer resource utilization analysis for microsoft excel and python in data processing”, Engineering, Mathematics and Computer Science Journal (EMACS), Volume 6, Issue 2, pp. 137 - 142, 2024. DOI: 10.21512/emacsjournal.v6i2.11736.
Mohamed Fakhry Mansour, Tarek Aly, and Mervat Gheith, “Python based end user computing framework to empowering excel efficiency”, International Journal for Research in Applied Science and Engineering Technology, Volume 12, Issue 4, pp. 2719 - 2729, 2024. DOI: 10.22214/ ijraset.2024.60097.
Raymond R. Panko and Richard P. Halverson Jr., “An experiment in collaborative spreadsheet development”, Journal of the Association for Information Systems, Volume 2, No. 1, pp. 1 - 31, 2001. DOI: 10.17705/1jais.00016.
Raymond R. Panko, “Thinking is bad: Implications of human error research for spreadsheet research and practice”, European Spreadsheet Risk Interest Group, 2007. DOI: 10.48550/arXiv.0801.3114.
Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, and Torsten Hoefler, “Productivity, portability, performance”, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021. DOI: 10.1145/3458817.3476176.
Diyyala Sravani, Jonnala Rohith Reddy, Pilla Sri Viswas, N.M. Jyothi, and Potru Chandukiran, “Python security in devOps: Best practices for secure coding, configuration management, and continuous testing and monitoring”, 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 6 - 8 July 2023. DOI: 10.1109/icesc57686.2023.10193128.
Aravind Ayyagiri, Arpit Jain, and Om Goel, “Utilizing Python for scalable data processing in cloud environments”, Darpan International Research Analysis, Volume 12, Issue 2, pp. 183 - 198, 2024. DOI: 10.36676/ dira.v12.i2.78.
Fabrizio Sebastiani, “Machine learning in automated text categorization”, ACM Computing Surveys, Volume 34, Issue 1, pp. 1 - 47, 2002. DOI: 10.1145/505282.505283.
ChengXiang Zhai and Sean Massung, Text data management and analysis: A practical introduction to information retrieval and text mining. Association for Computing Machinery and Morgan & Claypool, 2016. DOI: 10.1145/2915031.
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim, “Do we need hundreds of classifiers to solve real world classification problems?”, Journal of Machine Learning Research, Volume 15, pp. 3133 - 3181, 2014.
Asmaa M. Aubaid, Alok Mishra, and Atul Mishra, “Machine learning and rule-based embedding techniques for classifying text documents”, International Journal of Systems Assurance Engineering and Management, Volume 15, Issue 12, pp. 5637 -5652, 2024. DOI: 10.1007/s13198- 024-02555-w.
Karandeep Singh, Yu-Che Tsai, Cheng-Te Li, Meeyoung Cha, and Shou-De Lin, “GraphFC: Customs fraud detection with label scarcity”, 32nd ACM International Conference on Information and Knowledge Management, 2023. DOI: 10.1145/3583780.3614690.
1. The Author assigns all copyright in and to the article (the Work) to the Petrovietnam Journal, including the right to publish, republish, transmit, sell and distribute the Work in whole or in part in electronic and print editions of the Journal, in all media of expression now known or later developed.
2. By this assignment of copyright to the Petrovietnam Journal, reproduction, posting, transmission, distribution or other use of the Work in whole or in part in any medium by the Author requires a full citation to the Journal, suitable in form and content as follows: title of article, authors’ names, journal title, volume, issue, year, copyright owner as specified in the Journal, DOI number. Links to the final article published on the website of the Journal are encouraged.