Apache Spark – my next programming conquest
November 4, 2015 § Leave a comment
I have been a developer (at least at heart) all of my professional career. I have always found ways to keep my hands dirty in some type of coding effort. However, the older you get the more removed you become. Development is a young man’s game. However, I think I have found my next programming model playground.
If you haven’t noticed by now, Big Data is kind of a big deal. When you hear that things like “each day we produce 2.5 quintillion bytes of data” and “90% of the world’s data has been created in the last 2 years” it doesn’t take a genius to hypothesize that there might be some hidden value in all that data. It appears that the storage industry has no problem keeping up with this demand and networking is also getting better and better (I thought physics was involved but apparently we keep finding ways to get more through the same pipe).
So the problem to solve falls to the foot soldiers of every technical problem, the developers. And the default landscape that all developers maneuver and and work in is open source. The Apache Spark project is another great example of the open source ecosystem gaining unfathomably quick traction in solving a problem.
I got to go to the IBM Insight conference last week and I took that opportunity to learn some things (between my booth duty stints). There were many Spark sessions to attend but most were full. They turned away many people at many Spark-based sessions. And being an IBMer, I got sent to the end-of-the-line for any walk up lab spots. However, I was able to attend a few sessions and learn some good things. I am beginning my self education by taking some Big Data University courses online (check them out here.
I have spent a lot of my time in the developer community not only pushing out code, but also constantly tweaking and consulting on the interactions between developers and the rest of the larger team. Spark brings a new interesting dimension to this dynamic. As we talked about before, the reason we are here is that there is lots of value in all that data. So Spark was created as a programming model to make it simple to carry out highly compute-intensive data manipulation. The difference here is that someone typically asks a question that potentially has a simple answer, but getting it goes beyond the typical programming models/platforms that exist today.
Let’s examine some of the differences that this problem set brings to the party.
- The user interface is not important. We have spent so much effort in UI design, frameworks, etc. due to the boon of the mobile device. Application look-and-feel and user experience is so important in the mobile space due to the intense competition between vendors. In the big data space, the answer is really the only thing we care about.
- The requirements are simple and the results are typically simple, getting there is the hard part. The vast majority of the work done by Spark applications is the chunking of data. A very simple application (very few lines) can perform massive amounts of processing. The Spark platform is the ultimate effort in pushing all of the complexity below the development experience.
- Data scientists are the big data analysts. The role of the data scientist is the role that sits between the line-of-business and the developer. Data scientists know the data that is being captured and help translate the question being asked to the Spark developer. As a matter of fact, with tools like the Data Scientist Workbench, we are providing a platform for Data Scientists to learn enough about Spark to do the work themselves.
- The art is in understanding how Spark works and programming the effort accordingly. Understanding how Spark divides up the work, when and where to store intermediate data (if at all), and tuning the program accordingly is where a Spark developer brings the value. Programming a web application can be done with a single-user mentality in mind. Scaling the application can be tackled at other levels of the architecture. As I said before, the requirements for Big Data apps are typically simple and the answer is typically also simple, but the time it takes for the application to get to that answer is all that counts.
As I explore Spark more, I will keep you posted along the way. Let me know your thoughts.