Intro to Apache Spark
July 16, 2015 § Leave a comment
There is a lot of discussion these days related to the use of disk becoming outdated and the need to shift to memory as a prime storage mechanism for high performance computing. This is especially true around analytical based use cases where there are already numerous in-memory solutions related to databases or even computing appliances. The latest entrant to this space is Apache Spark which is an open source framework for in-memory processing. Similar to Hadoop, Spark is a cluster computing system which uses memory rather than disk to cache data (see this link to learn more about Spark) as well as better support for programming frameworks other than MapReduce.
Spark should not be thought of as a replacement to Hadoop but rather a complementary solution set. An enterprise may have a Hadoop environment designed as a central place to store all types of data (Data Lake) before it is sent to other environments such as ETL staging areas for formalized reporting or Spark for ad-hoc analysis. Due to its in-memory nature, Spark will be an ideal choice for quick iterative-type analysis with a reported 100x faster performance times compared to Hadoop (even in disk, Spark is 10x faster).
As companies start to adopt this hybrid mindset, it is likely that initial Spark use cases will leverage a cloud bursting based model where an existing environment pushes certain data to an off premises based Spark as a service offering. As soon as the processing is finished, the results can be pushed back to the main environment and the Spark instance can be deprovisioned. Especially due to the currently higher prices of memory, this pay per use model is a good fit for what’s likely to be specialized and short batches of analysis.
As a note, Spark as a service will be offered on IBM Bluemix and prospective users can sign up to be among the first to use it at the following link: http://www.spark.tc/beta/.