Application of Python and machine learning in processing and classifying import-export data

  • Do Hong Hanh Vietnam Petroleum Institute (VPI)
  • Doan Tien Quyet Vietnam Petroleum Institute (VPI)
  • Doan Trong Sinh Vietnam National Industry - Energy Group (PVN)
  • Nguyen Bang Linh Vietnam Petroleum Institute (VPI)
Keywords: Import, export, Python, machine learning, data normalization, product classification, TF-IDF, random forest, automation

Abstract

Vietnam’s import-export data is increasing substantially in both scale and complexity, creating significant challenges in standardizing and classifying customs declaration information. This study proposes an automated data-processing pipeline implemented in the Python programming language, with the objective of enhancing efficiency and ensuring greater consistency in the analysis of customs information. The input dataset comprises more than 10,000 real-world import-export records, which are processed through a structured sequence of technical steps, including product name normalization, unit conversion, computation of quantitative indicators, and keyword-based product group labeling.
The experimental results demonstrate that this processing pipeline operates effectively on medium-scale, high-complexity datasets, while considerably improving classification accuracy and ensuring uniformity across product categories. Based on these findings, the authors propose integrating machine learning models as a supplementary tool to enhance generalization capabilities and adaptability to exceptional cases - particularly relevant in a trade environment where product names are increasingly diverse, unstandardized, and continuously evolving.

References

Kelvin Kelvin, Wahidin Wahab, and Meirista Wulandari, “Computer resource utilization analysis for microsoft excel and python in data processing”, Engineering, Mathematics and Computer Science Journal (EMACS), Volume 6, Issue 2, pp. 137 - 142, 2024. DOI: 10.21512/emacsjournal.v6i2.11736.

Mohamed Fakhry Mansour, Tarek Aly, and Mervat Gheith, “Python based end user computing framework to empowering excel efficiency”, International Journal for Research in Applied Science and Engineering Technology, Volume 12, Issue 4, pp. 2719 - 2729, 2024. DOI: 10.22214/ ijraset.2024.60097.

Raymond R. Panko and Richard P. Halverson Jr., “An experiment in collaborative spreadsheet development”, Journal of the Association for Information Systems, Volume 2, No. 1, pp. 1 - 31, 2001. DOI: 10.17705/1jais.00016.

Raymond R. Panko, “Thinking is bad: Implications of human error research for spreadsheet research and practice”, European Spreadsheet Risk Interest Group, 2007. DOI: 10.48550/arXiv.0801.3114.

Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, and Torsten Hoefler, “Productivity, portability, performance”, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021. DOI: 10.1145/3458817.3476176.

Diyyala Sravani, Jonnala Rohith Reddy, Pilla Sri Viswas, N.M. Jyothi, and Potru Chandukiran, “Python security in devOps: Best practices for secure coding, configuration management, and continuous testing and monitoring”, 4th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 6 - 8 July 2023. DOI: 10.1109/icesc57686.2023.10193128.

Aravind Ayyagiri, Arpit Jain, and Om Goel, “Utilizing Python for scalable data processing in cloud environments”, Darpan International Research Analysis, Volume 12, Issue 2, pp. 183 - 198, 2024. DOI: 10.36676/ dira.v12.i2.78.

Fabrizio Sebastiani, “Machine learning in automated text categorization”, ACM Computing Surveys, Volume 34, Issue 1, pp. 1 - 47, 2002. DOI: 10.1145/505282.505283.

ChengXiang Zhai and Sean Massung, Text data management and analysis: A practical introduction to information retrieval and text mining. Association for Computing Machinery and Morgan & Claypool, 2016. DOI: 10.1145/2915031.

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim, “Do we need hundreds of classifiers to solve real world classification problems?”, Journal of Machine Learning Research, Volume 15, pp. 3133 - 3181, 2014.

Asmaa M. Aubaid, Alok Mishra, and Atul Mishra, “Machine learning and rule-based embedding techniques for classifying text documents”, International Journal of Systems Assurance Engineering and Management, Volume 15, Issue 12, pp. 5637 -5652, 2024. DOI: 10.1007/s13198- 024-02555-w.

Karandeep Singh, Yu-Che Tsai, Cheng-Te Li, Meeyoung Cha, and Shou-De Lin, “GraphFC: Customs fraud detection with label scarcity”, 32nd ACM International Conference on Information and Knowledge Management, 2023. DOI: 10.1145/3583780.3614690.

Published
2025-09-30
How to Cite
Do, H. H., Doan, T. Q., Doan, T. S., & Nguyen , B. L. (2025). Application of Python and machine learning in processing and classifying import-export data. Petrovietnam Journal, 3, 41-50. https://doi.org/10.47800/PVSI.2025.03-05