STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

Blair, Walter

STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

Authors

Blair, Walter

Abstract

Modern big data processing pipelines speed up individual applications by executing them in extremely fast in-memory frameworks like Apache Spark that run atop distributed disk systems like Apache Hadoop. Many big-data problems are solved by pipelines composed of many applications, and the conventional use of hard disk I/O to link one application to the next presents a significant performance bottleneck. Processing pipelines composed of conventional legacy applications as well as modern in-memory applications can be executed with a memory-only approach using a combination of Apache Spark, Apache Ignite’s persistent cache and in-memory file system, and Unix tools like shared memory and named pipes. We compared the performance of the conventional disk-I/O approach to our memory-only approach in a short Spark-based pipeline as well in the conventional Tuxedo workflow for RNA-seq processing. Our results demonstrate the benefits of reusing existing legacy applications as well as adopting memory-only alternatives to disk in the development of state-of-the-art high-performance processing pipelines. Future work will extend the current approach across several big-data processing use cases.