Intro to Apache Spark

July 16, 2015 § Leave a comment

There is a lot of discussion these days related to the use of disk becoming outdated and the need to shift to memory as a prime storage mechanism for high performance computing. This is especially true around analytical based use cases where there are already numerous in-memory solutions related to databases or even computing appliances. The latest entrant to this space is Apache Spark which is an open source framework for in-memory processing. Similar to Hadoop, Spark is a cluster computing system which uses memory rather than disk to cache data (see this link to learn more about Spark) as well as better support for programming frameworks other than MapReduce.

Spark should not be thought of as a replacement to Hadoop but rather a complementary solution set. An enterprise may have a Hadoop environment designed as a central place to store all types of data (Data Lake) before it is sent to other environments such as ETL staging areas for formalized reporting or Spark for ad-hoc analysis. Due to its in-memory nature, Spark will be an ideal choice for quick iterative-type analysis with a reported 100x faster performance times compared to Hadoop (even in disk, Spark is 10x faster).

As companies start to adopt this hybrid mindset, it is likely that initial Spark use cases will leverage a cloud bursting based model where an existing environment pushes certain data to an off premises based Spark as a service offering. As soon as the processing is finished, the results can be pushed back to the main environment and the Spark instance can be deprovisioned. Especially due to the currently higher prices of memory, this pay per use model is a good fit for what’s likely to be specialized and short batches of analysis.

As a note, Spark as a service will be offered on IBM Bluemix and prospective users can sign up to be among the first to use it at the following link:



Tagged: ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

What’s this?

You are currently reading Intro to Apache Spark at Thoughts and Insights from IBM Cloud Advisors.


%d bloggers like this: