Compiling OpenCV with CUDA (GPU) using CMake (GUI)

I have a tendency to choose the exact wrong thing every time when given a choice; I have sometimes wondered why.  A good thing with doing things almost wrong is that you get to learn about things.I have a feeling that doing things wrong and getting feedback and correcting is somehow fundamental in the way learning process happens.
However hot this article  will save you some time and some hair pulling. I started out with Windows and then Linux, but most thing are common.

Before we start, just a very short introduction into the why part,trust me just the bare essentials.

OpenCV operates on images , which in computers (at least the ones we have now) is stored a pixel matrices. Various algorithms that opencv provides, for example for object detection for example does a lot of matrix operations. These operations are 'embarrassingly parallel' -data parallel and could be speeded if executed in the GPU.

Now NVDIA GPU have an  parallel programming  API called CUDA which can help in speeding up matrix multiplication. And OpenCV has support for the same; to use it however you need to compile OpenCV with CUDA. CUDA is NVDIA proprietary and it would work with only NVDIA  GPUs.

There is an open API which should work with different typed of GPU cards and that is OpenCL. However it may not be that  tuned for a particular card through. OpenCV has support for OpenCL too; however we will for now use CUDA.

Finally one more thing; CUDA uses BLAS libraries. The CUDA SDK provided by NVDIA has the cublas libraries for it. Don't ask me why I chose to compile OpenBLAS for it; as I said before CMake gives a lot of choices and if you don't know as much as above, you are sure to do some totally unnecessary but very instructive things.

Okay now to to the how;

on Windows

First check if your PC or laptop has an NVDIA card. The easiest to do is via dxdiag windows utility

Now see if your card supports CUDA. There is a good utility from TechPowerUp GPU-Z that  will show this information among other like GPU load etc; which will help later to see if the programs are really using the GPU; Or you could check the NVDIA website for the Card and see if it supports; I guess most cards do; or you could check the very detailed page in wikipedia which lists the various generations of the processors

Next step is to download the CUDA SDK from NVDIA.; If you have a 64 bit system download the 64 bit SDK. Choose defaults and install it.

Then download the OpenCV source code from GIT and download CMake tool. You need to download MS Visual Studio Community edition for C++ compiler.

The main thing in correct compilation is to choose the right settings in CMake; First these are the minimum WITH variables needed to be configured

Miss few or mess with few and you will have lot of errors coming.
I tried guessing and removed and got lot of errors while running the program. WITH_CUDA is mandatory; If you need to see the videos image in GUI make sure to select WIN32UI and FFMPEG. I am still not sure if some are needed or why they are here. Please don't feel appaled; I learn this way; I have no clue initially and I learn to figure it out the hard way. It is something to do with being stupid.Why I removed the defaults was to cut down on the compile time from better half of the day to something more reasonable.

Then I found that the best way to reduce the compile time was to limit the architecture to the number I though the GPU card was supporting. In my case for GeForce GT 720M card in the CUDA wiki page the architecture code name was Fermi and compute capability was given as 2.1 . That did not work; so I gave 2.0 and I found compile time decreased considerably.

After that you Configure and make sure you select the 64 bit Visual Studio Compiler. Select 32 bit or do some other mistake and you will be led to lot of Configurations erros

If that is the case CMake will automatically select the 64 bit libraries from the CUDA SDK. Else it will try to take the 32 bit libraies and you may get configuration error about BLAS

With that you may be able to compile your OpenCV program . Note that when I used the default ARCH_BIN setting which goes all the way from 1 to 5 I got some linker errors -

Severity Code Description Project File Line Suppression State
Error LNK2019 unresolved external symbol __cudaRegisterLinkedBinary_54_tmpxft_000028d8_00000000_15_gpu_mat_compute_37_cpp1_ii_71482d89 referenced in function "void __cdecl __sti____cudaRegisterAll_54_tmpxft_000028d8_00000000_15_gpu_mat_compute_37_cpp1_ii_71482d89(void)" (?__sti____cudaRegisterAll_54_tmpxft_000028d8_00000000_15_gpu_mat_compute_37_cpp1_ii_71482d89@@YAXXZ) opencv_core D:\build\opencv2\modules\core\ 1

For your program using  the above built OpenCV usually most of the libraries  given below are needed. If your build of OpenCV is proper you would get these many dlls in the output folder. If some are missing try to build it from Visual Studio


If you get include errors see the link

Finally check with GPU-Z and see if running the program is really using the GPU

Note for building your OpenCV solutions (1) using these libs and the following headers have to be added to OpenCV

(1) People detection example -

Include Directories

C/C++ --> General --> Additional Include Directories

- D:\opencv\modules\calib3d\include;D:\opencv\modules\videoio\include;D:\opencv\modules\video\include;D:\opencv\modules\imgcodecs\include;D:\opencv\modules\cudaoptflow\include;D:\opencv\modules\cudastereo\include;D:\build\opencv4;d:\opencv\modules\core\include;D:\opencv\modules\cudawarping\include;D:\opencv\include;D:\opencv\modules\cudaobjdetect\include;D:\opencv\modules\cudaimgproc\include;D:\opencv\modules\imgproc\include;D:\opencv\modules\highgui\include;D:\opencv\modules\objdetect\include;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include

Note opencv2/opencv_modules.hpp is from the opencv build folder d:\build\opencv4\opencv2
and not from opencv git source; This dir should be in include path
 (d:\build\opencv4\ is the output directory specified in CMake)

Linker-->Input--> Additional Dependencies --> opencv_calib3d320.lib;opencv_core320.lib;opencv_features2d320.lib;opencv_flann320.lib;opencv_highgui320.lib;opencv_imgcodecs320.lib;opencv_imgproc320.lib;opencv_ml320.lib;opencv_objdetect320.lib;opencv_shape320.lib;opencv_ts320.lib;opencv_video320.lib;opencv_videoio320.lib;opencv_cudaimgproc320.lib;opencv_cudaarithm320.lib;opencv_cudabgsegm320.lib;opencv_cudacodec320.lib;opencv_cudalegacy320.lib;opencv_cudaobjdetect320.lib;opencv_cudawarping320.lib;opencv_cudev320.lib;opencv_cudafilters320.lib;%(AdditionalDependencies)

Lib Directories : - D:\build\opencv4\lib\Release

Note: This is not needed for Linux as the include will be already in /usr/include. Please see a sample Cmake file for linux with OpenCV

cmake_minimum_required(VERSION 2.8)
project( XXX )
find_package( OpenCV REQUIRED )
add_executable(XXX VideoTestHaar.cpp)
add_executable(YYY ColorTracker.cpp)
target_link_libraries( XXX ${OpenCV_LIBS} )
target_link_libraries( YYY  ${OpenCV_LIBS} )

Here is what I did to install the latest OpenCV in an x86 664 bit machine running Ubuntu

Update: Have updated the Docker file and image Using this image will simplify your tasks.

    //  The below should be done at the beginning; I did not do this and got some broken package error above; so did it; you learn the hard way :)
     sudo apt-get -y update
     sudo apt-get -y upgrade
     sudo apt-get -y autoremove
//video codecs and other libs; these many are not given in opencv site but got this from some other blog; I am not sure what is the bare minimum ; Some in red are not needed. It also depends on how you confgure in mkae for OpenCV build
  sudo   apt-get install -y cmake  pkg-config \
 zlib1g-dev ffmpeg libwebp-dev \
 libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev \
 libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
# These extra does not seem to be needed;
# qt5-default  libtiff5-dev libopenexr-dev libgdal-dev   libdc1394-22-dev  libeigen3-dev 

In Ubuntu (16.04) make sure that you install the  NVIDIA  driver for the card. Check the latest driver version from Nvidia site for your card. Then add the relevant repository and install. Please follow this

sudo apt-add-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-3xx

sudo modeporbe nvidia (also ran this before restart)

Check via nvidia-smi command

alex@alex-Lenovo-G400s-Touch:~$ nvidia-smi
Tue Feb 28 15:10:50 2017    
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GT 720M     Off  | 0000:01:00.0     N/A |                  N/A |
| N/A   51C    P0    N/A /  N/A |    271MiB /  1985MiB |     N/A      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|    0                  Not Supported                                         |

Install samples and test via deviceQuery after making the samples -->

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 720M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1985 MBytes (2081685504 bytes)
  ( 2) Multiprocessors, ( 48) CUDA Cores/MP:     96 CUDA Cores
  GPU Max Clock rate:                            1550 MHz (1.55 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 720M
Result = PASS


Popular posts from this blog

OpenLayers Advanced Clustering and Setting dynamic images for OpenLayer Styles via Ajax

Long running Java process resource consumption monitoring , leak detection and GC tuning