STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

dc.contributor.authorBlair, Walter
dc.date.accessioned2018-08-21T17:14:46Z
dc.date.available2018-08-21T17:14:46Z
dc.date.updated2018-08-21T17:14:46Z
dc.description.abstractModern big data processing pipelines speed up individual applications by executing them in extremely fast in-memory frameworks like Apache Spark that run atop distributed disk systems like Apache Hadoop. Many big-data problems are solved by pipelines composed of many applications, and the conventional use of hard disk I/O to link one application to the next presents a significant performance bottleneck. Processing pipelines composed of conventional legacy applications as well as modern in-memory applications can be executed with a memory-only approach using a combination of Apache Spark, Apache Ignite’s persistent cache and in-memory file system, and Unix tools like shared memory and named pipes. We compared the performance of the conventional disk-I/O approach to our memory-only approach in a short Spark-based pipeline as well in the conventional Tuxedo workflow for RNA-seq processing. Our results demonstrate the benefits of reusing existing legacy applications as well as adopting memory-only alternatives to disk in the development of state-of-the-art high-performance processing pipelines. Future work will extend the current approach across several big-data processing use cases.
dc.identifier.urihttp://hdl.handle.net/123456789/3709
dc.language.rfc3066en
dc.titleSTREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM
Files