Programming with Apache Spark and Cassandra -draft
Putting the knowledge gained so far in
this and frequent questions that many
may ask and what we have asked ourselves.
1.1
What is the need for using
Spark ?
Spark gives you horizontal scale ability in
a programmer friendly way.
1.2
But what about other options ?
There are other options as well. I have
listed them below, which describes and highlights Sparks place in the
architecture
Type
|
Level of Granularity
|
Descritpion
|
LoadBalancer
(nginx,haproxy)
|
Request
Level
(usually
HTTP requests)
|
Works
well for Request-response type client server protocols. Works also well in
context of microservices in application program side
However
to scale the processing insdie the application programs this is inadequate
|
Task
Managers
(celery,
other MQ based)
|
Task
Level
|
Helps
to scale processing in the application program.Takes care of Task handling.
However the onus is on the developer to split application logic to
independent tasks. Usually
only
the simplest things are really split into tasks. Equally hard problem is
combining the outputs
|
Cluster
Computing
(Apache Spark,Hadoop)
|
Application
Level,Function
level
|
Helps
to scale processing in the application layer across. Takes care of all the
above. The onus is still on the developer to use this properly. However if
the few API*, map, foreach,reduce and groupBy/partionBy are used , the
programmer can be written as if it is is running in a single node, in a
single thread. The system manages shared RAM across mutiple nodes, shared
cores, task scheduling, multithreading etc. *P.S - Spark has an extensive
library for machine learning as well,which could be the gateway for future
|
Multithreading
|
Function
level
|
Helps
to scale the processing inside a single node across nodes. Usually has to be
done with care to avoid the complexity of threading related problems which
many programmers are unaware
|
Green
Threads
|
Funciton
level/Stack level
|
Ex
Greenlets in Python ; Good for switching stack in IO bound applications ;
example socket server etc; Not really parallel, but wait time in one stack
frame can be used by other stacks waiting to execute. Rather specific for
general purpose usage
|
How
stable is Apache Spark and Apache Cassandra ?
Speaking from our limited experience in
running the prototype, all of the Spark and Casandra JVMs survived 20 days of
load runs, network problems , application exceptions we threw at them.And that
too in a low end cloud lab. Looks to be well written
1.5
What is the most important
thing to take care when using Apache Cassandra ?
Data modelling and connected the Primary
key and partition key design. It is important to design your primary key and
the partition key so that write are distributed as well as read are faster.
This is explained well by the Cassandra expert here -> http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/
The hash of the partition key is used by
Cassandra to identify the node in which to store. So choosing a partition key
that distributes the load equally among nodes prevent write hotspots. Example
can be seem in the performance
run page
P,S - There are few trivial but important
things , like writing commit log and data(SSTable) in different partition. This link
gives basic info about write path.
1.6
What is the most important
thing to take care when using Apache Spark?
Have not come across as single important
thing as such, but couple of pointers
1. Avoid doing any major work in Spark driver , rdd.collect() or the more
better rdd.toLocalIteraror() are not good ideas and don't scale; You get OOM
error soon
2. There is no way to share state like counters etc between driver and
workers, though in the code it may seem so. Only way is via accumilators ; and
there workers cannot read;
3. The way you partition the RDD may be important for performance; esp
for operation like group by etc ; need to test and understand this better
Comments