STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

Blair, Walter

STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

dc.contributor.author	Blair, Walter
dc.date.accessioned	2018-08-21T17:14:46Z
dc.date.available	2018-08-21T17:14:46Z
dc.date.updated	2018-08-21T17:14:46Z
dc.description.abstract	Modern big data processing pipelines speed up individual applications by executing them in extremely fast in-memory frameworks like Apache Spark that run atop distributed disk systems like Apache Hadoop. Many big-data problems are solved by pipelines composed of many applications, and the conventional use of hard disk I/O to link one application to the next presents a significant performance bottleneck. Processing pipelines composed of conventional legacy applications as well as modern in-memory applications can be executed with a memory-only approach using a combination of Apache Spark, Apache Ignite’s persistent cache and in-memory file system, and Unix tools like shared memory and named pipes. We compared the performance of the conventional disk-I/O approach to our memory-only approach in a short Spark-based pipeline as well in the conventional Tuxedo workflow for RNA-seq processing. Our results demonstrate the benefits of reusing existing legacy applications as well as adopting memory-only alternatives to disk in the development of state-of-the-art high-performance processing pipelines. Future work will extend the current approach across several big-data processing use cases.
dc.identifier.uri	http://hdl.handle.net/123456789/3709
dc.language.rfc3066	en
dc.title	STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

Collections

Electronic Theses

STREAMLINING BIG DATA PROCESSING PIPELINES VIA UNIX MEMORY TOOLS, PERSISTENT SPARK DATASETS, AND THE APACHE IGNITE IN-MEMORY FILE SYSTEM

Files

Collections