VIETNAMESE TEXT SUMMARIZATION BASE BERT METHODS
Abstract
This paper introduces the method of text summarization in the two directions of extraction and summarization, using a pre-trained language model. To do this, for the extraction problem, we use the BERTSum model. The model uses BERT (Bidirectional Encoder Representations from Transformers) to encode input sentences and uses LSTM (Long Short Term Memory Networks) to represent relationships between sentences. For the summary problem, we use BERT to encode the semantics of the input text to generate a suitable summary. We tested the method on a Vietnamese dataset shared from VNDS (A Vietnamese Dataset for Summarization) [19] and evaluated the method by ROUGE (Recall - Oriented Understudy for Gisting Evaluation). Experimental results show that between the two problems of abstraction summarization and summarization, BERT is more effective in the problem of abstraction.
References
Nguyễn Nhật An, Nghiên cứu phát triển các kỹ thuật tự động tóm tắt văn bản tiếng Việt, Luận án tiến sĩ, Viện Khoa học và Công nghệ Quân sự, tr. 8-23, 2015.
Đoàn Xuân Dũng, Tóm tắt văn bản sử dụng các kỹ thuật trong Deep Learning, Luận văn thạc sĩ, trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, tr. 1-8, 2018.
Nguyễn Viết Hạnh, Nghiên cứu tóm tắt văn bản tự động và ứng dụng, Luận văn thạc sĩ, trường Đại học Công nghệ, Đại học Quốc gia Hà Nội, tr. 12-16, 2018.
Đỗ Thị Thu Trang, Trịnh Thị Nhị, Ngô Thanh Huyền, “Sử dụng BERT cho tóm tắt trích rút văn bản”. Tạp chí Khoa học và Công nghệ Trường Đại học Sư phạm Kỹ thuật Hưng Yên, 2020, Số 26/ tháng 6 năm 2020, tr. 74-79.
Lâm Quang Tường, Phạm Thế Phi, và Đỗ Đức Hào, “Tóm tắt văn bản tiếng Việt tự động với mô hình SEQUENCE-TO-SEQUENCE”. Tạp chí Khoa học Trường Đại học Cần Thơ, 2017, Số chuyên đề: Công nghệ Thông tin (2017), tr. 125-132.
BOSER B., GUYON I., VAPNIK V., “A Training Algorithm for Optimal Margin Classifiers”. Proceedings of the Fifth Annual Workshop on Computational Learning Theory (ACM), 1992, pp. 144-152.
BURGES C., “A Tutorial on Support Vector Machines for Pattern Recognition”. Proceedings of Int Conference on Data Mining and Knowledge Discovery, 1998, Vol 2, No 2, pp 121-167.
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate”, 2014, CoRR, abs/1409.0473. Retrieved from https://arxiv.org/abs/1409.0473/
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, 2018. In arΧiv:1810.04805v2 [cs.CL].
Keras, “A Word2Vec Keras Tutorial”, 2017, Retrieved from https://adventuresinmachinelearning.com/word2vec-keras-tutorial/
Knight, Kevin, and Daniel Marcu, “Statistics-Based Summarization - Step one: Sentence compression”. 17th National Conference of Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence (AAAI-2000), 2000, pp. 703-710.
Lin, C.-Y. and Hovy, E. H., “Automatic Evaluation of Summaries Using N-gram co-occurrence Statistics”, in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 2003, Volume 1 (NAACL-HLT), pp. 71-78.
M.-T. Nguyen, H.-D. Nguyen, T.-H.-N. Nguyen, and V.-H. Nguyen, “Towards state-of-the-art baselines for vietnamese multi-document summarization”, in 10th International Conference on Knowledge and Systems
Engineering (KSE), 2018, pp. 85-90.
Martins, A.F., Smith, N.A., “Summarization with a Joint Model for Sentence Extraction and Compression”, in Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing, Association for Computational Linguistics, 2009, pp. 1–9.
Minh-Tien Nguyen, A Study on Web Document Summarization by Exploiting Its Social Context. Doctoral Dissertation. School of Information Science Japan Advanced Institute of Science and Technology, 2018.
Nenkova A., McKeown K., “A Survey of Text Summarization Techniques”, in Mining Text Data, Springer, 2012, pp. 43-76.
Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. In Google AI Blog. URL: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html/
Rau, Lisa F, and Paul S Jacobs, “Creating Segmented Database from Free Text for Text Retrieval” SIGIR ’91 Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1991, pp. 337-346.
Van - Hau Nguyen , Minh - Tien Nguyen, Thanh - Chinh Nguyen, Xuan - Hoai Nguyen, “VNDS: A Vietnamese Dataset for Summarization”. In IEEE Conference Proceedings (IEEE Conf Proc), 2019, pp. 375-380.
William B Dolan and Chris Brockett, “Automatically Constructing a Corpus of Sentential Paraphrases”, in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
Xin Rong, “Word2vec Parameter Learning Explained”, 2016 - In arXiv, 1411.2738v4
A. Nenkova, “Automatic text summarization of newswire: Lessons Learned from the Document Understanding Conference,” in AAAI, 2005, pp. 1436-1441.
Y. Gong and X. Liu, “Generic text summarization using relevant measure and latent semantic analysis,” in SIGIR, pp. 19-25, 2001.
G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization”. Journal of Artificial Intelligence Research, 22, 2004, pp. 457-479.
Brin, S.; Page, L., “The anatomy of a large-scale hypertextual Web search engine” (PDF). Computer Networks and ISDN Systems, 1998, 30 (1–7), pp. 107–117.
S. Sripada and J. Jagarlamudi, “Summarization approaches based on document probability distributions,” in PACLIC, 2009, pp. 521-529.
L. Vanderwendea, H. Suzukia, C. Brocketta, and A. Nenkova, “Beyond : Task-focused summarization with sentence simplification and lexical expansion”. Information Processing & Management, Elsevier, 2007, 43, 6, pp. 1606-1618.
C. Cortes and V. Vapnik, “Support-vector networks”. Machine Learning, 1995, 20(3), pp. 73-297.