PERFORMANCE COMPARISON BETWEEN HIVE AND SPARK IN STRUCTURED BIG DATA ANALYSIS

Do Thi Dao; Nguyen Van Quyet

Do Thi Dao Faculty of Information Technology, Hung Yen University of Technology and Education
Nguyen Van Quyet Faculty of Information Technology, University of Transport and Communications

Keywords: Big Data, Performance Analysis Tools, Big Data Analytics

Abstract

With the explosion of IoT devices, the massive amount of data generated has rendered traditional data analysis platforms inadequate. Big data analytics platforms have emerged, enabling organizations to uncover hidden values within the data and make swift business decisions. Among these, Apache Hive and Apache Spark have risen as two modern data analysis platforms commonly used for analyzing structured big data. Several studies have compared these two platforms and found that in most cases, SparkSQL in Spark is faster than Hive on MapReduce. However, in some cases involving queries with lots of joins operations among large data tables, SparkSQL is slower than Hive on MapReduce. Recently, SparkSQL has undergone numerous optimization techniques that could lead to better query performance, such as upgrading the query processing scheduler optimizer. In this paper, we present the performance comparision between Hive on MapReduce and SparkSQL. To achieve this, we study the implementation of performance evaluation for big data queries using BigBench and the performance analysis tool PAT. We aggregate and analyze important performance parameters, including query execution time, disk usage, memory usage, CPU usage, and the amount of data read/written over the network. The experimental results show that Spark SQL outperforms Hive on MapReduce with all 20 query sets of BigBench.

References

The Apache Software Foundation, “Apache Hadoop”, [Online]. Available: https://hadoop.apache.org/. (Accessed: 10/01/2023).

The Apache Software Foundation, “Apache Spark”, [Online]. Available: https://spark.apache.org/. (Accessed: 10/01/2023).

The Apache Software Foundation, “Apache Hive”, [Online]. Available: https://hive.apache.org/. (Accessed: 10/01/2023).

The Apache Software Foundation, “Apache Storm”, [Online]. Available: https://storm.apache.org/. (Accessed: 10/01/2023).

The Apache Software Foundation, “Apache Flink”, [Online]. Available: https://flink.apache.org/ (Accessed: 10/01/2023).

The Apache Software Foundation, “Apache Cassandra”, [Online]. Available: https://cassandra. apache.org/ (Accessed: 10/01/2023).

The Apache Software Foundation, “Apache Kafka,” [Online]. Available: https://kafka.apache.org/. (Accessed10/01/2023)

Ivanov, Todor, and Max-Georg Beer. "Evaluating hive and spark SQL with BigBench." arXiv preprint arXiv:1512.08417, 2015.

Ứng dụng Big Data, [Online]. Available: https://ifactory.com.vn/10-ung-dung-big-data-noi-batthay-doi-cach-van-hanh-nen-kinh-te (Accessed: 10/01/2023).

PigMix Benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.

B. Chowdhury, T. Rabl, P. Saadatpanah, J. Du, and H.-A. Jacobsen, “A BigBench Implementation in the Hadoop Ecosystem,” in Advancing Big Data Benchmarks, T. Rabl, N. Raghunath, M. Poess, M. Bhandarkar, H.-A. Jacobsen, and C. Baru, Eds. Springer International Publishing, 2013, pp. 3–18.

Nguyen Van Quyet, Kyungbaek Kim, Performance Evaluation between Hive on MapReduce and Spark SQL with BigBench and PAT, 2017.

T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch, “A data generator for cloud-scale benchmarking,” in Performance Evaluation, Measurement and Characterization of Complex Systems, Springer, 2011, pp. 41–56.

A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen, “BigBench: Towards an Industry Standard Benchmark for Big Data Analytics,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2013, pp. 1197–1208.

TPC, “TPC-DS.” [Online]. Available: http://www.tpc.org/tpcds/.

P. Bailis, K. Ghodsi, J. Hellerstein, and I. Stoica, “Benchmarking Big Data Systems,” Proceedings of the VLDB Endowment, 2014, vol. 7, no. 8, pp. 641–652.

T. Rabl, M. Frank, H. Jacobsen, S. Pigeon, J. Gonzalez, and S. Chong, “A Performance Evaluation of SQL-on-Hadoop Systems,” in Proceedings of the VLDB Endowment, 2016, vol. 9, no. 5, pp. 372–383.