AN INVESTIGATION OF VIETNAMESE DOCUMENT CLASSIFICATION

Bui Duc Tho; Nguyen Hoang Diep; Do Thi Thu Trang; Nguyen Thi Hai Nang; Ngo Thanh Huyen; Minh-Tien Nguyen; Van-Hau Nguyen

Bui Duc Tho Hung Yen University of Technology and Education
Nguyen Hoang Diep Hung Yen University of Technology and Education
Do Thi Thu Trang Hung Yen University of Technology and Education
Nguyen Thi Hai Nang Hung Yen University of Technology and Education
Ngo Thanh Huyen Hung Yen University of Technology and Education
Minh-Tien Nguyen Hung Yen University of Technology and Education
Van-Hau Nguyen Hung Yen University of Technology and Education

Abstract

Automatic text classification is one of the most interesting task in data mining. This task has to deal with a huge amount of data. Many studies have been investigated for English, however, the investigation of Vietnamese is still an early stage. This paper investigates several text classification methods: Super Vector Machine, Naive Bayes Classification, K-Nearest Neighbors, Multi-layer perceptron, Decision Tree, Random Forest using TF-IDF. The experiments in Vietnamese datasets show that Super Vector Machine and Multi-layer perceptron perform better than the other methods.

References

Nguyễn Việt Cường, Sử dụng các khái niệm tập mờ trong biểu diễn văn bản và ứng dụng vào bài toán phân lớp văn bản. Khóa luận tốt nghiệp đại học, Đại học Công nghệ - Đại học Quốc gia Hà Nội, 2006.

John Shafer, Rakesh Agrawal, Manish Mehta. SPRINT- A Scalable Paralllel Classifier for Data mining. In Predeeings of the 22nd International Conference on Very Large Database, India, 1996.

Mohammed J. Zaki, Ching-Tien Ho, Rekesh Agrawal. Parallel Classification for Data Mining on Shared-Memory Multiprocessors. IVM Almaden Research Center, San Jose, CA 95120.

Thorsten Joachims, Text categorization with Support Vector Machines: Learning with many relevant features, 1998: Machine Learning: ECML-98, pp. 137-142.

Trần Thị Oanh, Thuật toán self-training và co-training ứng dụng trong phân lớp văn bản. Khóa luận tốt nghiệp đại học, Đại học Công nghệ - Đại học Quốc gia Hà Nội, 2006.

T. JOACHIMS, “Text categorization with Support Vector Machines: Learning with many relevant features”, Technical Report 23, LS VIII, University of Dortmund, 1997.

Đỗ Bích Diệp, Phân loại văn bản dựa trên mô hình đồ thị, Luận văn cao học, Đại học Tổng hợp New South Wales – Australia, 2004.

Lê Quang Hòa, Ứng dụng thuật toán k láng giềng gần nhất trong phân loại tin tức theo chủ đề, Luận văn ThS, ĐH Bách Khoa Đà Năng, 2018.

Giang-Son Nguyen, X. Gao, and P. Andreae, “Text categorization for vietnamese documents,” in 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 2009, vol. 3, pp. 466-469.

Chen S, K-nearest neighbor algorithm optimization in text categorization. In: IOP conference series: earth and environmental science. IOP Publishing, 2018, Vol. 108, No. 5, pp. 052074.

Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69, pp. 1356–1364.

Lưu Trường Huy, Nghiên cứu cải tiến một số phương pháp phân loại văn bản tự động và áp dụng trong xử lý văn bản tiếng việt, Luận văn thạc sỹ, Trường Đại học Công nghệ - Đại học Quốc gia Hà Nội.

Bùi Khánh Linh, Nguyễn Quỳnh Anh, Nguyễn Nhật An, Nguyễn Thị Thu Hà, Đào Thanh Tĩnh, Phân loại văn bản tiếng việt dựa trên mô hình chủ đề và lý thuyết Naive Bayes. Tạp chí Nghiên cứu KH&CN quân sự, 06-2015, số 37.

S. Wang and C. D. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012, pp. 90–94.

F. Rousseau, E. Kiagias, and M. Vazirgiannis, “Text categorization as a graph classification problem,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language rocessing, 2015, pp. 1702–1712.

T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” in Proceedings of the 5th International Conference on Learning Representations, ser. ICLR ’17, 2017.

V. C. D. Hoang, D. Dinh, N. le Nguyen, and H. Q. Ngo, “A comparative study on vietnamese text classification methods,” in 2007 IEEE International Conference on Research, Innovation and Vision for the Future, 2007, pp. 267–273.

T. H. Nguyen, N. H. Nghia, D. L. Tuan, and V. T. Nguyen, “A hybrid feature selection method for vietnamese text classification”, International Conference on Knowledge and Systems Engineering (KSE), 2015, pp. 91–96.

L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in n 33rd AAAI Conference on Artificial intelligence (AAAI-19), 2019, pp. 7370–7377.

Huy-The Vu, Van-Hau Nguyen, Van-Quyet Nguyen, Minh-Tien Nguyen, “Vietnamese Document Classification Using Graph Convolutional Network”, International Conference on Knowledge and Systems Engineering (KSE), 2020.

S. Dumais, J. Platt, D. Heckerman, M.Sahami, “Inductive learning algorithms and representations for text categorization”, Proceedings of Conference on Information and Knowledge Management (CIKM), 1998, pp 148-155.

Taud, H., and J. F. Mas. “Multilayer perceptron (MLP).” Geomatic Approaches for Modeling Land Change Scenarios. Springer, Cham, 2018, pp. 451-455.