PERFORMANCE ANALYSIS OF PARALLEL COMPUTING ARCHITECTURE USING HADOOP FRAMEWORK

Nguyen Minh Quy; Ho Khanh Lam; Nguyen Xuan Truong; Nguyen Dinh Han; Nguyen Van Hau; Do Anh Tuan; Pham Quoc Hung

Nguyen Minh Quy Hung Yen University of Technology and Education
Ho Khanh Lam Hung Yen University of Technology and Education
Nguyen Xuan Truong Hung Yen University of Technology and Education
Nguyen Dinh Han Hung Yen University of Technology and Education
Nguyen Van Hau Hung Yen University of Technology and Education
Do Anh Tuan Hung Yen University of Technology and Education
Pham Quoc Hung Hung Yen University of Technology and Education

Keywords: Hadoop, HiveQL, Parallel processing, packet flow analysis, performance analysis

Abstract

Parallel computing architecture based on PC cluster systems using Hadoop Framework in the past few years received great attention because of the computing power and expansion of the system is quite favorable and easy, especially is in handling extremely large data amounting to Tera-bytes or even Peta-bytes. However, the performance of this architecture depends on many factors, such as installing algorithm, HiveQL queries optimization, cluster architectures, etc ... This paper proposes solutions using parallel computing architecture Hive on Hadoop Framework for processing network packet flows in order to detect the presence of viruses and analyzed and givea number of solutions to improve performance of the computing system.

References

http://www.tcpdump.org

http://www.splintered.net

http://www.packet-sniffer.net/

http://www.fukuda-lab.org/mawilab

http://sortbenchmark.org/

www.wireshark.org

http://www.ti.com/tool/packet-sniffer

https://supportforums.cisco.com/docs/DOC-25009

http://hadoop.apache.org

http://hive.apache.org

http://www.cisco.com/web/go/netflow

L. Deri, nProbe, “an Open Source NetFlow Probe for Gigabit Networks”. TERENA Networking Conference, May 2003.

J. Dean, MapReduce, “Simplified Data Processing on Large Cluster”, OSDI, 2004.

Youngseok Lee, Wonchul Kang, “An Internet Traffic Analysis Method with MapReduce”. IEEE/IFIP Network Operations and Management Symposium Workshops, 2010.

J. Dean, S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. of the 6th Symposium on Operating Systems Design and Implementation, San Francisco CA, Dec. 2004.

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, U. Srivastava. “Building a High-Level Dataflow System on top of MapReduce: The Pig Experience,” In Proc. of Very Large Data Bases, vol 2 no. 2, 2009, pp. 1414–1425.

S. Ghemawat, H. Gobioff, S. Leung. “The Google File System,” In Proc. of ACM Symposium on Operating Systems Principles, Lake George, NY, Oct 2003, pp 29–43.

T. White, Hadoop, “The Definitive Guide”. O’Reilly Media, Yahoo! Press, June 5, 2009.