Sunday, February 05, 2017

Programming with Apache Spark and Cassandra -draft


Putting the knowledge gained so far in this  and frequent questions that many may ask and what we have asked ourselves.
Spark gives you horizontal scale ability in a programmer friendly way.
There are other options as well. I have listed them below, which describes and highlights Sparks place in the architecture

Type
Level of Granularity
Descritpion
LoadBalancer
(nginx,haproxy)
Request Level
(usually HTTP requests)
Works well for Request-response type client server protocols. Works also well in context of microservices in application program side
However to scale the processing insdie the application programs this is inadequate
Task Managers
(celery, other MQ based)
Task Level
Helps to scale processing in the application program.Takes care of Task handling. However the onus is on the developer to split application logic to independent tasks. Usually
only the simplest things are really split into tasks. Equally hard problem is combining the outputs
Cluster Computing
(Apache Spark,Hadoop)
Application Level,Function
level
Helps to scale processing in the application layer across. Takes care of all the above. The onus is still on the developer to use this properly. However if the few API*, map, foreach,reduce and groupBy/partionBy are used , the programmer can be written as if it is is running in a single node, in a single thread. The system manages shared RAM across mutiple nodes, shared cores, task scheduling, multithreading etc. *P.S - Spark has an extensive library for machine learning as well,which could be the gateway for future
Multithreading
Function level
Helps to scale the processing inside a single node across nodes. Usually has to be done with care to avoid the complexity of threading related problems which many programmers are unaware
Green Threads
Funciton level/Stack level
Ex Greenlets in Python ; Good for switching stack in IO bound applications ; example socket server etc; Not really parallel, but wait time in one stack frame can be used by other stacks waiting to execute. Rather specific for general purpose usage

It takes less than 10 minutes to download and setup Spark. For a software this capable there are surprisingly few Please follow steps in Working with the EE Cloud.

It takes less than 10 minutes to download and setup Spark. For a software this capable there are surprisingly few Please follow steps in Working with the EE Cloud.

How stable is Apache Spark and Apache Cassandra ?
Speaking from our limited experience in running the prototype, all of the Spark and Casandra JVMs survived 20 days of load runs, network problems , application exceptions we threw at them.And that too in a low end EE cloud lab. Looks to be well written
Data modelling and connected the Primary key and partition key design. It is important to design your primary key and the partition key so that write are distributed as well as read are faster. This is expalined well by the Cassandra expert here -> http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/
The hash of the partition key is used by Cassandra to identify the node in which to store. So choosing a partition key that distributes the load equally among nodes prevent write hotspots. Example can be seem in the performance run page
P,S - There are few trivial but important things , like writing commit log and data(SSTable)  in different partition. This link gives basic info about write path.
Have not come across as single important thing as such, but couple of pointers
1.    Avoid doing any major work in Spark driver , rdd.collect() or the more better rdd.toLocalIteraror() are not good ideas and don't scale; You get OOM error soon
2.    There is no way to share state like counters etc between driver and workers, though in the code it may seem so. Only way is via accumilators ; and there workers cannot read;
3.    The way you partition the RDD may be important for performance; esp for operation like group by etc ; need to test and understand this better


Saturday, February 04, 2017

Compiling OpenCV with CUDA (GPU) using CMake (GUI)


I have a tendency to choose the exact wrong thing every time when given a choice; I have sometimes wondered why.  A good thing with doing things almost wrong is that you get to learn about things.I have a feeling that doing things wrong and getting feedback and correcting is somehow fundamental in the way learning process happens.
However hot this article  will save you some time and some hair pulling. I started out with Windows and then Linux, but most thing are common.

Before we start, just a very short introduction into the why part,trust me just the bare essentials.

OpenCV operates on images , which in computers (at least the ones we have now) is stored a pixel matrices. Various algorithms that opencv provides, for example for object detection for example does a lot of matrix operations. These operations are 'embarrassingly parallel' -data parallel and could be speeded if executed in the GPU.

Now NVDIA GPU have an  parallel programming  API called CUDA which can help in speeding up matrix multiplication. And OpenCV has support for the same; to use it however you need to compile OpenCV with CUDA. CUDA is NVDIA proprietary and it would work with only NVDIA  GPUs.

There is an open API which should work with different typed of GPU cards and that is OpenCL. However it may not be that  tuned for a particular card through. OpenCV has support for OpenCL too; however we will for now use CUDA.

Finally one more thing; CUDA uses BLAS libraries. The CUDA SDK provided by NVDIA has the cublas libraries for it. Don't ask me why I chose to compile OpenBLAS for it; as I said before CMake gives a lot of choices and if you don't know as much as above, you are sure to do some totally unnecessary but very instructive things.

Okay now to to the how;

on Windows

First check if your PC or laptop has an NVDIA card. The easiest to do is via dxdiag windows utility


Now see if your card supports CUDA. There is a good utility from TechPowerUp GPU-Z that  will show this information among other like GPU load etc; which will help later to see if the programs are really using the GPU; Or you could check the NVDIA website for the Card and see if it supports; I guess most cards do; or you could check the very detailed page in wikipedia which lists the various generations of the processors  https://en.wikipedia.org/wiki/CUDA



Next step is to download the CUDA SDK from NVDIA.https://developer.nvidia.com/cuda-downloads; If you have a 64 bit system download the 64 bit SDK. Choose defaults and install it.

Then download the OpenCV source code from GIT and download CMake tool. You need to download MS Visual Studio Community edition for C++ compiler.

The main thing in correct compilation is to choose the right settings in CMake; First these are the minimum WITH variables needed to be configured



Miss few or mess with few and you will have lot of errors coming.
I tried guessing and removed and got lot of errors while running the program. WITH_CUDA is mandatory; If you need to see the videos image in GUI make sure to select WIN32UI and FFMPEG. I am still not sure if some are needed or why they are here. Please don't feel appaled; I learn this way; I have no clue initially and I learn to figure it out the hard way. It is something to do with being stupid.Why I removed the defaults was to cut down on the compile time from better half of the day to something more reasonable.

Then I found that the best way to reduce the compile time was to limit the architecture to the number I though the GPU card was supporting. In my case for GeForce GT 720M card in the CUDA wiki page the architecture code name was Fermi and compute capability was given as 2.1 . That did not work; so I gave 2.0 and I found compile time decreased considerably.


After that you Configure and make sure you select the 64 bit Visual Studio Compiler. Select 32 bit or do some other mistake and you will be led to lot of Configurations erros



If that is the case CMake will automatically select the 64 bit libraries from the CUDA SDK. Else it will try to take the 32 bit libraies and you may get configuration error about BLAS



With that you may be able to compile your OpenCV program . Note that when I used the default ARCH_BIN setting which goes all the way from 1 to 5 I got some linker errors -

Severity Code Description Project File Line Suppression State
Error LNK2019 unresolved external symbol __cudaRegisterLinkedBinary_54_tmpxft_000028d8_00000000_15_gpu_mat_compute_37_cpp1_ii_71482d89 referenced in function "void __cdecl __sti____cudaRegisterAll_54_tmpxft_000028d8_00000000_15_gpu_mat_compute_37_cpp1_ii_71482d89(void)" (?__sti____cudaRegisterAll_54_tmpxft_000028d8_00000000_15_gpu_mat_compute_37_cpp1_ii_71482d89@@YAXXZ) opencv_core D:\build\opencv2\modules\core\cuda_compile_generated_gpu_mat.cu.obj 1


For your program using  the above built OpenCV usually most of the libraries  given below are needed. If your build of OpenCV is proper you would get these many dlls in the output folder. If some are missing try to build it from Visual Studio


opencv_calib3d320.lib
opencv_core320.lib
opencv_features2d320.lib
opencv_flann320.lib
opencv_highgui320.lib
opencv_imgcodecs320.lib
opencv_imgproc320.lib
opencv_ml320.lib
opencv_objdetect320.lib
opencv_shape320.lib
opencv_ts320.lib
opencv_video320.lib
opencv_videoio320.lib
opencv_cudaimgproc320.lib
opencv_cudaarithm320.lib
opencv_cudabgsegm320.lib
opencv_cudacodec320.lib
opencv_cudaimgproc320.lib
opencv_cudalegacy320.lib
opencv_cudaobjdetect320.lib
opencv_cudawarping320.lib
opencv_cudev320.lib
opencv_cudafilters320.lib

If you get include errors see the link

http://answers.opencv.org/question/29885/does-opencv_moduleshpp-exist-in-opencv248/

Finally check with GPU-Z and see if running the program is really using the GPU


Note for building your OpenCV solutions (1) using these libs and the following headers have to be added to OpenCV

(1) People detection example - https://gist.github.com/alexcpn/aeb8a4b8304639d8f91cc2fbc0c1c7df

Include Directories

C/C++ --> General --> Additional Include Directories

- D:\opencv\modules\calib3d\include;D:\opencv\modules\videoio\include;D:\opencv\modules\video\include;D:\opencv\modules\imgcodecs\include;D:\opencv\modules\cudaoptflow\include;D:\opencv\modules\cudastereo\include;D:\build\opencv4;d:\opencv\modules\core\include;D:\opencv\modules\cudawarping\include;D:\opencv\include;D:\opencv\modules\cudaobjdetect\include;D:\opencv\modules\cudaimgproc\include;D:\opencv\modules\imgproc\include;D:\opencv\modules\highgui\include;D:\opencv\modules\objdetect\include;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include

Note opencv2/opencv_modules.hpp is from the opencv build folder d:\build\opencv4\opencv2
and not from opencv git source; This dir should be in include path
 (d:\build\opencv4\ is the output directory specified in CMake)




Libs
Linker-->Input--> Additional Dependencies --> opencv_calib3d320.lib;opencv_core320.lib;opencv_features2d320.lib;opencv_flann320.lib;opencv_highgui320.lib;opencv_imgcodecs320.lib;opencv_imgproc320.lib;opencv_ml320.lib;opencv_objdetect320.lib;opencv_shape320.lib;opencv_ts320.lib;opencv_video320.lib;opencv_videoio320.lib;opencv_cudaimgproc320.lib;opencv_cudaarithm320.lib;opencv_cudabgsegm320.lib;opencv_cudacodec320.lib;opencv_cudalegacy320.lib;opencv_cudaobjdetect320.lib;opencv_cudawarping320.lib;opencv_cudev320.lib;opencv_cudafilters320.lib;%(AdditionalDependencies)

Lib Directories : - D:\build\opencv4\lib\Release

Note: This is not needed for Linux as the include will be already in /usr/include. Please see a sample Cmake file for linux with OpenCV

cmake_minimum_required(VERSION 2.8)
project( XXX )
find_package( OpenCV REQUIRED )
add_executable(XXX VideoTestHaar.cpp)
add_executable(YYY ColorTracker.cpp)
include_directories("/usr/local/include/opencv2")
include_directories("${OpenCV_INCLUDE_DIRS}")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Wall")
target_link_libraries( XXX ${OpenCV_LIBS} )
target_link_libraries( YYY  ${OpenCV_LIBS} )


Here is what I did to install the latest OpenCV in an x86 664 bit machine running Ubuntu

    //  The below should be done at the beginning; I did not do this and got some broken package error above; so did it; you learn the hard way :)
     sudo apt-get -y update
     sudo apt-get -y upgrade
     sudo apt-get -y autoremove
//video codecs and other libs; these many are not given in opencv site but got this from some other blog; I am not sure what is the bare minimum ; Some in red are not needed. It also depends on how you confgure in mkae for OpenCV build
     sudo apt-get install -y libtbb-dev libeigen3-dev
     sudo apt-get install cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
     sudo apt-get install libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev
     sudo apt-get install -y qt5-default
     sudo apt-get install -y zlib1g-dev libjpeg-dev libwebp-dev libpng-dev libtiff5-dev libjasper-dev libopenexr-dev libgdal-dev

In Ubuntu (16.04) make sure that you install the  NVIDIA  driver for the card. Check the latest driver version from Nvidia site for your card. Then add the relevant repository and install. Please follow this http://alexpunnen.blogspot.in/2017/03/installupgrade-nvidi-driver-in-ubuntu.html

sudo apt-add-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-3xx

sudo modeporbe nvidia (also ran this before restart)

Check via nvidia-smi command

alex@alex-Lenovo-G400s-Touch:~$ nvidia-smi
Tue Feb 28 15:10:50 2017    
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 720M     Off  | 0000:01:00.0     N/A |                  N/A |
| N/A   51C    P0    N/A /  N/A |    271MiB /  1985MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+


Install samples and test via deviceQuery after making the samples --> http://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/verify_cuda_install.html

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 720M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1985 MBytes (2081685504 bytes)
  ( 2) Multiprocessors, ( 48) CUDA Cores/MP:     96 CUDA Cores
  GPU Max Clock rate:                            1550 MHz (1.55 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 720M
Result = PASS



Install/Upgrade NVIDI Driver in Ubuntu for CUDA SDK

Most linux distribution comes with the Nouveau https://nouveau.freedesktop.org/wiki/ display driver configured. If you need to use NVIDIA...