Programming with Apache Spark and Cassandra -draft

Putting the knowledge gained so far in this and frequent questions that many may ask and what we have asked ourselves.

1.1 What is the need for using Spark ?

Spark gives you horizontal scale ability in a programmer friendly way.

1.2 But what about other options ?

There are other options as well. I have listed them below, which describes and highlights Sparks place in the architecture

Type	Level of Granularity	Descritpion
LoadBalancer (nginx,haproxy)	Request Level (usually HTTP requests)	Works well for Request-response type client server protocols. Works also well in context of microservices in application program side However to scale the processing insdie the application programs this is inadequate
Task Managers (celery, other MQ based)	Task Level	Helps to scale processing in the application program.Takes care of Task handling. However the onus is on the developer to split application logic to independent tasks. Usually only the simplest things are really split into tasks. Equally hard problem is combining the outputs
Cluster Computing (Apache Spark,Hadoop)	Application Level,Function level	Helps to scale processing in the application layer across. Takes care of all the above. The onus is still on the developer to use this properly. However if the few API, map, foreach,reduce and groupBy/partionBy are used , the programmer can be written as if it is is running in a single node, in a single thread. The system manages shared RAM across mutiple nodes, shared cores, task scheduling, multithreading etc. P.S - Spark has an extensive library for machine learning as well,which could be the gateway for future
Multithreading	Function level	Helps to scale the processing inside a single node across nodes. Usually has to be done with care to avoid the complexity of threading related problems which many programmers are unaware
Green Threads	Funciton level/Stack level	Ex Greenlets in Python ; Good for switching stack in IO bound applications ; example socket server etc; Not really parallel, but wait time in one stack frame can be used by other stacks waiting to execute. Rather specific for general purpose usage

How stable is Apache Spark and Apache Cassandra ?

Speaking from our limited experience in running the prototype, all of the Spark and Casandra JVMs survived 20 days of load runs, network problems , application exceptions we threw at them.And that too in a low end cloud lab. Looks to be well written

1.5 What is the most important thing to take care when using Apache Cassandra ?

Data modelling and connected the Primary key and partition key design. It is important to design your primary key and the partition key so that write are distributed as well as read are faster. This is explained well by the Cassandra expert here -> http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/

The hash of the partition key is used by Cassandra to identify the node in which to store. So choosing a partition key that distributes the load equally among nodes prevent write hotspots. Example can be seem in the performance run page

P,S - There are few trivial but important things , like writing commit log and data(SSTable) in different partition. This link gives basic info about write path.

1.6 What is the most important thing to take care when using Apache Spark?

Have not come across as single important thing as such, but couple of pointers

1. Avoid doing any major work in Spark driver , rdd.collect() or the more better rdd.toLocalIteraror() are not good ideas and don't scale; You get OOM error soon

2. There is no way to share state like counters etc between driver and workers, though in the code it may seem so. Only way is via accumilators ; and there workers cannot read;

3. The way you partition the RDD may be important for performance; esp for operation like group by etc ; need to test and understand this better

Search This Blog

Technical Logs

Programming with Apache Spark and Cassandra -draft

Comments

Popular posts from this blog

Long running Java process resource consumption monitoring , leak detection and GC tuning

OpenLayers Advanced Clustering and Setting dynamic images for OpenLayer Styles via Ajax

Unix - Checking the virtual memory of process