STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM
dc.contributor.author | Blair, Walter | |
dc.date.accessioned | 2018-08-21T17:14:46Z | |
dc.date.available | 2018-08-21T17:14:46Z | |
dc.date.updated | 2018-08-21T17:14:46Z | |
dc.description.abstract | Modern big data processing pipelines speed up individual applications by executing them in extremely fast in-memory frameworks like Apache Spark that run atop distributed disk systems like Apache Hadoop. Many big-data problems are solved by pipelines composed of many applications, and the conventional use of hard disk I/O to link one application to the next presents a significant performance bottleneck. Processing pipelines composed of conventional legacy applications as well as modern in-memory applications can be executed with a memory-only approach using a combination of Apache Spark, Apache Ignite’s persistent cache and in-memory file system, and Unix tools like shared memory and named pipes. We compared the performance of the conventional disk-I/O approach to our memory-only approach in a short Spark-based pipeline as well in the conventional Tuxedo workflow for RNA-seq processing. Our results demonstrate the benefits of reusing existing legacy applications as well as adopting memory-only alternatives to disk in the development of state-of-the-art high-performance processing pipelines. Future work will extend the current approach across several big-data processing use cases. | |
dc.identifier.uri | http://hdl.handle.net/123456789/3709 | |
dc.language.rfc3066 | en | |
dc.title | STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM |