Tuesday, September 25, 2012

Finally Holidays... and some GPU code!

During the next week I'll be visiting London.

In the meanwhile enjoy a code preview of A GPU Exercise part 1 and part 2, and stay tuned for part 3 :)

To get, compile, and test the code, clone my funproject repository on github.
git clone git://github.com/cybercase/funproject.git
To compile the code is required the NVIDIA CUDA Toolkit. Then run:
cd funproject
cmake .
If CUDA Toolkit was found by CMake, then the lev_distance target should be built into the experiments directory. To quickly test the performances run:
cd experiments
./lev_distance lev_distance.cu thread.h
The following result is achieved on my late 2010 MBA
lev_distance.cu size is 8716 bytes
thread.h size is 17818 bytes
elapsed time: 9.240 (s)
elapsed time: 0.872 (s)
Results matches!
Distance: 14332
Please note that this is an educational implementation, and several improvements can be done to exploit all the features of the underlying architecture...

But before talking about this... holidays! :)

Sunday, September 16, 2012

A GPU Exercise - part 2

This is the sequel of previous post: A GPU Exercise - part 1

Let's resume from the algorithm for computing the Levenshtein distance among two strings. To fill every cell of the subrpoblems matrix, we can allocate a 2 dimensional array and then proceed like this:

// C-Like syntax: don't expect it compiles :)
int lev_distance(char* string_one, char* string_two) {
 int m = strlen(string_one)+1;
 int n = strlen(string_two)+1;
 int matrix[m][n];

 for (int j=0; j<n; ++j) { // Step1: Init first row
  matrix[0][j] = j;

 for (int i=0; i<m; ++i) { // Step2: Init first column
  matrix[i][0] = i;

 for (int i=1; i<m; ++i) { // Step3: Fill every cell
  for (int j=1; j<n; ++j) {
   if (string_one[i-1] == string_two[j-1]) {
    matrix[i][j] = matrix[i-1][j-1];
   } else {
    matrix[i][j] = min(matrix[i][j-1],
          matrix[i-1][j-1]) + 1;
 return matrix[m-1][n-1]; // cell containing cost of the real problem

But, if we look carefully at Step3 inner loop, it's easy to see that there is not need to store the entire matrix since, for each cell, we need just the immediately preceding ones. Hence, to reduce allocated memory, it's possible to change the algorithm like this:

int lev_distance(char* string_one, char* string_two)
 int m = strlen(string_one); // Note: removed +1
 int n = strlen(string_two)+1;

    int current[n];
    int previous[n];
    for (int j=0; j<n; ++j) {
        previous[j] = j;

    for (int i=0; i<m; ++i) {
        current[0] = i+1;
        for (int j=1; j<n; ++j) {
         // Note: the if is replaced by the conditional assignment
            current[j] = (previous[j]+1, current[j-1]+1, 
                previous[j-1] + (string_one[i] != string_two[j-1] ? 1 : 0) );
        swap(current, previous);
    return previous[n-1];

Now, only 2 array of size strlen(string_two)+1 are stored.

This algorithm is quite good for a singlecore cpu implementation, since it has good data locality... Small strings can be stored entirely inside the cache memory to achive the best performances. However, think about optimising this algorithm for a multicore CPU... It's easy to notice the strong data dependancy: we can't compute the value of a cell if we didn't already computed the previous ones.

To overcome this problem and take the full advantage of using multiple cores we need to change the algorithm. And these are the changes (can you find another way?).

Instead of walking over the matrix by rows and then by columns, we can walk by diagonals to avoid some data dependancies. Look at the followings steps:

  1. Compute the value of cell $c_{1,1}$ (previous needed values $c_{1,0}$ $c_{0,1}$ and $c_{0,0}$ comes from initialisation) $$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & y_3 & \cdots \\ \hline \# & 0 & 1 & 2 & 3 \\ \hline x_1 & 1 & \mathbf{c_{1,1}} \\ \hline x_2 & 2\\ \hline x_3 & 3 \\ \hline \vdots\\ \hline \end{array}
  2. Compute cells $c_{2,1}$ and $c_{1,2}$: cell $c_{2,1}$ requires cells $c_{1,1}$ (computed in the previous step), cell $c_{2,0}$ and $c_{1,0}$ (both comes from initialisation). Same for cell $c_{1,2}$. $$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & y_3 & \cdots \\ \hline \# & 0 & 1 & 2 & 3 \\ \hline x_1 & 1 & c_{1,1} & \mathbf{c_{1,2}} \\ \hline x_2 & 2 & \mathbf{c_{2,1}} \\ \hline x_3 & 3 \\ \hline \vdots \\ \hline \end{array}
  3. Compute cell $c_{3,1}$ $c_{2,2}$ and $c_{1,3}$: cell $c_{2,2}$ requires cells $c_{1,1}$ (computed in step 1) $c_{2,1}$ and $c_{1,2}$ (from step 2) $$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & y_3 & \cdots \\ \hline \# & 0 & 1 & 2 & 3 \\ \hline x_1 & 1 & c_{1,1} & c_{1,2} & \mathbf{c_{1,3}} \\ \hline x_2 & 2 & c_{2,1} & \mathbf{c_{2,2}} \\ \hline x_3 & 3 & \mathbf{c_{3,1}} \\ \hline \vdots \\ \hline \end{array}
And so on...
At a given step, each cell refers to the cells computed in the previous steps, and never to the ones of the same. Precisely, filling the cells in step $i$ requires filling cells at step $i-1$ and $i-2$.

This allow us to rewrite the algorithm in a way such that all of the cells in a certain diagonal can be filled concurrently since the data dependancy has been moved among diagonals.

Unfortunately I don't have enough time to show the pseudo code of this algorithm in this post because, even if this new algorithm can be conceptually simple to understand, a good implementation requires the handling of a some special cases that will make the code snippet definitely too long to explain in this blogpost...

Instead, I'm going to push (and talk about) the code on my github account during the next blogpost.

... suspense is raising, isn't it? :)

Thursday, September 13, 2012

A GPU Exercise - part 1

It's been several months since the last time I put my hands on something related to GPUs. Sadly, I get very rarely the chance to get this kind of jobs.
So, to keep myself trained (just in case I will get one), during the last two weeks I worked on an "exercise project".

For this exercise project I choosed to port to GPU an algorithm that I studied during my University years. Here you can find some informations:

I'll try to explain in few words and with a little example what is the Levenshtein distance.
Given two strings, the Levenstein distance is the minimum number of character insertion, deletion or substitution, to change a string into the other, ie:

The Levenshtein distance between RISOTTO and PRESTO is 4:

  1. Insert P
  2. Substitute I with E
  3. Delete O
  4. Delete T

Finding the shortest list of operations to change a string into another is a problem that can be solved by using the Dynamic Programming technique. The main trouble is that, in many cases, this technique requires an huge amount of memory to store intermediate results and avoid recomputing.

In fact, this algorithm requires a amount of memory proportional to the product of the lengths of the 2 strings: if len(string1) is m, and len(string2) is n, then the memory and operations requirement is O(m*n).

The required memory amount can be reduced in several ways... expecially exploiting some of the upper bounds of the optimal solution (see the wikipedia link at the top of this article).

However, if we compute just the number of operations (without actually knowing which operations are performed), the memory requirement can be reduced up to O(max(m, n)), even though the number of operations remains unchanged at O(m*n).

Let's try to understand how the algorithm works, and hence, why computing the optimal cost, in terms of number of changes, is cheaper (in terms of memory) than computing the entire operations list.

Given two string $X$ and $Y$ such that $X=x_1,x_2,...,x_m$ and $Y=y_1,y_2,...,y_n$ , we can use a matrix to represent the solutions of each subproblem.

$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & \cdots & y_n \\ \hline \#\\ \hline x_1\\ \hline x_2\\ \hline \vdots\\ \hline x_m\\ \hline \end{array}$

Where cell $i,j$ will contain the cost of the optimal solution for the subproblem of changing string $X_i=\#,x_1,x_2,...,x_i$ into $Y_i=\#,y_1,y_2,...,y_j$ with $i \leq m$ and $j \leq n$. The $ \# $ character represents the empty string.
Note that the first row and column of this matrix, are quite easy to fill:

  • The first column contains the costs of changing any of the $X$ substrings into the empty $\#$ string. So, substring $X_1$ (formed by the empty string plus the $x_1$ character) will require 1 deletion to change into substring $Y_0$ (the empty string)... $X_2$ will require 2 deletion, $X_3$ is 3, and so on...
  • The first row contains the cost of changing the empty string into the substring $Y_j$. So, the cost of changing the empty string into $Y_1$, will be 1 insertion, $Y_2$ is 2, and so on...
  • Obviusly, the cell $0,0$ containing the cost of changing $\#$ in $\#$, that is 0.
\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & \cdots & y_n \\ \hline \# & 0 & 1 & 2 & \cdots & n\\ \hline x_1 & 1\\ \hline x_2 & 2\\ \hline \vdots\\ \hline x_m & m\\ \hline \end{array}
Now that we have defined the base cases, we can look at how it's possible to fill the remaining cells, and thus finding the costs of the optimal solutions. Let's take as example the cell $1,1$ corresponding to subproblem of changing $X_1=\#,x_1$ into $Y1=\#,y_1$.

The easiest case is when $x_1=y_1$. In this case we don't need to perform any operation over the previous subproblem of changing $X_0$ into $Y_0$, hence the costs are the same.

If $x_1$ differs from $y_1$, then we can choose the cheapest among the three following options:

  • Substitute $x_1$ with $y_1$. The cost will be 1 plus the cost of changing $X_0$ into $Y_0$.
  • Insert the character $y_1$ to $Y_0$. The cost will be 1 plus the cost of changing $X_1$ into $Y_0$
  • Delete the character $x_1$ from the string $X_1$. the cost will be 1 plus the cost of changing $X_0$ into $Y_1$
In the same way it's possible to compute the values for all the other cells, by looking at the cells associated to the immediately preceding subproblems. In the end we will get the value of subproblem $(X_m, Y_n)$, that is the problem from which we started from.

Let's make some observations:
- The value of cell $(j,i)$ depends only from the comparison between $x_i$ and $y_j$ and value in cells $(j-1,i-1)$, $(j-1,i)$, $(j,i-1)$.

- The operation that was performed to produce the value in cell $(i,j)$ can be deduced only by looking at $x_i$ and $y_j$ and value inside cells $(j-1,i-1)$, $(j-1,i)$, $(j,i-1)$.

This makes easy computing the value of a cell $i,j$ since we have to remember just the values of the immediatly preceeding subproblems.
On the other hand, to retrive all the operations associated to the optimal solution, it's necessary to store (or recompute) the values of all the preceeding cells, since it's not possible to know which cells you will need, until you know the "subproblems path" to the optimal solution.

I hope you enjoied this simple introduction to the algorithm. In the next post I'm going straight into the implementation and the problems (and the solutions) of make run this algorithm on a GPU... attaching a lot of code of course :)

And naturally, if you spot errors, I would be really happy to receive reports ;)

Saturday, September 1, 2012

Debug pointer: quick overview

Let's focus on some keypoints of my debug_ptr:

  • What a debug_ptr is used for?
    A debug_ptr can be used like a standard pointer type to hold the address of a dynamically allocated object:
    typedef_pointer(int*, int_p);
    int_p p = new int(0);
  • How could it be useful?
    If you forget to delete a dynamically allocated object before loosing its last reference, you are probably creating a memory leak. A debug_ptr warns you about this fact using a warning policy.
    struct ThrowPolicy {
        ThrowPolicy() {
            throw std::runtime_error("lost reference to undeleted object");
    struct PrintPolicy {
        PrintPolicy() {
            std::cerr << "WARNING: lost reference "
                         "to undeleted object" << std::endl;
  • How much does it cost? (in performance terms)
    The features of debug_ptr are mostly useful in "debug mode". In a production release, you may not want use the features of a debug_ptr class (although you can), because of the performance losses due to the reference counting.
  • How did I address this issue?
    You can compile your software by defining ENABLE_DEBUG_TYPES to get the debug_ptr feature turned on. Otherwise all the debug_ptr features are turned off by a macro that will substitute any occurrence of debug_ptr<T> with a T*. This will avoid any overhead at compile and run time.
    #define typedef_pointer(Type, Alias, ...) typedef Type Alias
    #define typedef_array(Type, Alias, ...) typedef Type Alias
  • How debug_ptr syntax differs by the one of a standard pointer type?
    debug_ptr it's intended to be used just like any other smart_ptr, and in most of the cases you should not notice any difference in creating, copying, dereferencing, assigning, and deleting...
    typedef_pointer(int*, int_p);
    int_p p = new int(0);
    int_p p_other = new int(0);
    *p = 63;
    int_p p_other = p;
    delete p_other;
  • How did you get the same delete syntax of a standard pointer type?
    By defining a default cast operator to a pointer-to-deleter-class. The cast operator will be implicitly called and a new deleter object returns to operator delete. So, delete operator, deletes the deleter-class.
    operator deleter*() const {
        return deleter::new_deleter(pd_);

And that's all folks!

If you have any question, just leave a comment.
If you like this project, or have any other ideas about what could be the next step of the debug_ptr, fork the project on github and have fun :)
git clone git://github.com/cybercase/debug_ptr.git # clone the repo
cd debug_ptr
g++ -Wall -o test debug_ptr_test.cc -DENABLE_DEBUG_TYPES