Performance Evaluation of Apache Spark Vs MPI: A Practical Case Study on Twitter Sentiment Analysis

Deepa S Kumar; M Abdul Rahman

doi:10.3844/jcssp.2017.781.794

Research Article Open Access

Performance Evaluation of Apache Spark Vs MPI: A Practical Case Study on Twitter Sentiment Analysis

Deepa S Kumar¹ and M Abdul Rahman²

¹ Karpagam University, India
² APJ Abdul Kalam Technological University, India

Abstract

The advent of various processing frameworks which happens under big data technologies is due to tremendous dataset size and its complexity. The speed of execution was much higher with High Performance computing frameworks rather than big data processing frameworks. As majority of the jobs under big data are mostly data intensive rather than computation intensive, the High Performance Computing paradigms were not been used in big data processing. This paper reviews two distributed and parallel computing frameworks: Apache Spark and MPI. Sentiment analysis on twitter data is chosen as a test case application for benchmarking and implemented on Scala programming for spark processing and in C++ for MPI. Experiments were conducted on Google cloud virtual machines for three data set sizes, 100 GB, 500 GB and 1 TB to compare the execution times. Results shown that MPI outperforms Apache Spark in parallel and distributed cluster computing environments and hence the higher performance of MPI can be exploited in big data applications for improving speedups.

Journal of Computer Science

Volume 13 No. 12, 2017, 781-794

DOI: https://doi.org/10.3844/jcssp.2017.781.794

Submitted On: 12 September 2017 Published On: 23 December 2017

How to Cite: Kumar, D. S. & Rahman, M. A. (2017). Performance Evaluation of Apache Spark Vs MPI: A Practical Case Study on Twitter Sentiment Analysis. Journal of Computer Science, 13(12), 781-794. https://doi.org/10.3844/jcssp.2017.781.794

Copyright: © 2017 Deepa S Kumar and M Abdul Rahman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

7,655 Views
11,042 Downloads
8 Citations

Download

Keywords

Big Data
High Performance Computing
Apache Spark
MPI
Sentiment Analysis
Scala Programming
Cluster Computing