STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

Loading...
Thumbnail Image
Date
Authors
Blair, Walter
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Modern big data processing pipelines speed up individual applications by executing them in extremely fast in-memory frameworks like Apache Spark that run atop distributed disk systems like Apache Hadoop. Many big-data problems are solved by pipelines composed of many applications, and the conventional use of hard disk I/O to link one application to the next presents a significant performance bottleneck. Processing pipelines composed of conventional legacy applications as well as modern in-memory applications can be executed with a memory-only approach using a combination of Apache Spark, Apache Ignite’s persistent cache and in-memory file system, and Unix tools like shared memory and named pipes. We compared the performance of the conventional disk-I/O approach to our memory-only approach in a short Spark-based pipeline as well in the conventional Tuxedo workflow for RNA-seq processing. Our results demonstrate the benefits of reusing existing legacy applications as well as adopting memory-only alternatives to disk in the development of state-of-the-art high-performance processing pipelines. Future work will extend the current approach across several big-data processing use cases.
Description
Keywords
Citation