SỬ DỤNG BERT CHO TÓM TẮT TRÍCH RÚT VĂN BẢN

Do Thi Thu Trang; Trinh Thi Nhi; Ngo Thanh Huyen

Do Thi Thu Trang Hung Yen University of Technology and Education
Trinh Thi Nhi Hung Yen University of Technology and Education
Ngo Thanh Huyen Hung Yen University of Technology and Education

Keywords: Tóm tắt văn bản, xử lý ngôn ngữ, học máy, học sâu, học không giám sát

Abstract

Bài báo này giới thiệu một phương pháp tóm tắt trích rút các văn bản sử dụng BERT. Để làm điều này, các tác giả biểu diễn bài toán tóm tắt trích rút dưới dạng phân lớp nhị phân mức câu. Các câu sẽ được biểu diễn dưới dạng vector đặc trưng sử dụng BERT, sau đó được phân lớp để chọn ra những câu quan trọng làm bản tóm tắt. Chúng tôi thử nghiệm phương pháp trên 3 tập dữ liệu với 2 ngôn ngữ (Tiếng Anh và Tiếng Việt). Kết quả thực nghiệm cho thấy phương pháp cho kết quả tốt so với các mô hình khác.

References

H. P. Luhn, “The automatic creation of literature abstracts,” IBM Journal of Research Development, 2(2), pp. 159-165, 1958.

D. Shen, J.-T. Sun, H. Li, Q. Yang, and Z. Chen, “Document summarization using conditional random fields,” in IJCAI, pp. 2862-2867, 2007.

K. Hong and A. Nenkova, “Improving the estimation of word importance for news multi-document summarization,” in EACL, pp. 712-721, 2014.

Z. Cao, F. Wei, L. Dong, S. Li, and M. Zhou, “Ranking with recursive neural networks and its application to multi-document summarization,” in AAAI, pp. 2153-2159, 2015.

P. Ren, Z. Chen, Z. Ren, F. Wei, J. Ma, and M. de Rijke, “Leveraging contextual sentence relations for extractive summarization using a neural a ention model,” in SIGIR, 2017.

G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of Artificial Intelligence Research, 22, pp. 457-479, 2004.

K. Woodsend and M. Lapata, “Automatic generation of story highlights,” in ACL: 565-574, 2010.

J. A. B. Hui Lin, “A class of submodular functions for document summarization,” in ACL, pp. 510-520, 2011, June.

K. Woodsend and M. Lapata, “Multiple aspect summarization using integer linear programming,” in EMNLP-CoNLL, pp. 233-243, 2012.

S. Banerjee, P. Mitra, and K. Sugiyama, “Multi-document abstractive summarization using ilp based multi-sentence compression,” in IJCAI, pp. 1208-1214, 2015.

M.-T. Nguyen, T. V. Cuong, N. X. Hoai, and M.-L. Nguyen, “Utilizing user posts to enrich web document summarization with matrix cofactorization,” in SoICT, pp. 70-77, 2017.

T.-A. Nguyen-Hoang, K. Nguyen, and Q.-V. Tran, “Tsgvi: a graph-based summarization system for vietnamese documents,” Journal of Ambient Intelligence and Humanized Computing, 3(4), pp. 305-312, 2012.

V.-G. Ung, A.-V. Luong, N.-T. Tran, and M.-Q. Nghiem, “Combination of features for vietnamese news multi-document summarization,” in The Seventh International Conference on Knowledge and Systems Engineering (KSE), pp. 186-191, 2015.

H. Nguyen, T. Le, V.-T. Luong, M.-Q. Nghiem, and D. Dinh, “The combination of similarity measures for extractive summarization,” in Proceedings of the Seventh Symposium on Information and Communication Technology (SoICT), pp. 66-72, 2016.

J. Kupiec, J. O. Pedersen, and F. Chen, “A trainable document summarizer,” in SIGIR, pp. 68-73, 1995.

D. Wang, T. Li, S. Zhu, and C. Ding, “Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization,” in SIGIR, pp. 307-314, 2008.

J.-H. Lee, S. Park, C.-M. Ahn, and D. Kim, “Automatic generic document summarization based on non-negative matrix factorization,” Inf. Process. Manage, 45(1), pp. 20-34, 2009.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186, 2019.

Minh-TienNguyen, Chien-XuanTran, Duc-VuTran, and Minh-LeNguyen, SoLSCSum: A Linked Sentence-Comment Dataset for Social Context Summariza- tion. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2409-2412. ACM, 2016.

Minh-TienNguyen, VietDacLai, Phong-KhacDo, Duc-VuTran, and Minh-Le Nguyen, VSoLSCSum: Building a Vietnamese Sentence-Comment Dataset for Social Context Summarization. In The 12th Workshop on Asian Language Resources, pp. 38-48, 2016. Association for Computational Linguistics.

Minh-Tien Nguyen, Duc-Vu Tran, and Minh-Le Nguyen, Social Con- text Summarization using User-generated Content and Third-party Sources. Knowledge-Based Systems, 144(2018), pp. 51-64. Elsevier, 2018.

Wei, Z. and Gao, W., Utilizing microblogs for automatic news highlights extraction. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), pp. 872-883, 2014. Association for Computational Linguistics.

Ani Nenkova. Automatic text summarization of newswire: lessons learned from the document understanding conference. In AAAI, vol. 5, pp. 1436-1441, 2005.

Gunes Erkan and Dragomir R. Radev, Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, pp. 457-479, 2004.

Zhongyu Wei and Wei Gao, Gibberish, Assistant, or Master?: Using Tweets Linking to News for Extractive Single-Document Summarization. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1003-1006. ACM, 2015.

Minh-Tien Nguyen and Minh-Le Nguyen, SoRTESum: A Social Context Framework for Single-Document Summarization. In European Conference on Information Retrieval, pp. 3-14. Springer International Publishing, 2016.

https://github.com/andersjo/pyrouge