## Saturday, November 10, 2012

A month has passed since the end of my holidays in London.

I still have not found the time to write the final chapter of my GPU Exercise series (although all the code is available on github)...
I'm currently committed to a job that turned out to be more interesting of what expected: improving the performance of a Sparc VI / SunOS System software. I won't likely be able to write anything until this job ends.

However today I found something I wanted to share...

## Tuesday, September 25, 2012

### Finally Holidays... and some GPU code!

During the next week I'll be visiting London.

In the meanwhile enjoy a code preview of A GPU Exercise part 1 and part 2, and stay tuned for part 3 :)

To get, compile, and test the code, clone my funproject repository on github.
git clone git://github.com/cybercase/funproject.git

To compile the code is required the NVIDIA CUDA Toolkit. Then run:
cd funproject
cmake .
make

If CUDA Toolkit was found by CMake, then the lev_distance target should be built into the experiments directory. To quickly test the performances run:
cd experiments
./lev_distance lev_distance.cu thread.h
The following result is achieved on my late 2010 MBA
lev_distance.cu size is 8716 bytes
thread.h size is 17818 bytes
LevDistance...
elapsed time: 9.240 (s)
CudaLevDistance...
elapsed time: 0.872 (s)
Results matches!
Distance: 14332
Please note that this is an educational implementation, and several improvements can be done to exploit all the features of the underlying architecture...

## Sunday, September 16, 2012

### A GPU Exercise - part 2

This is the sequel of previous post: A GPU Exercise - part 1

Let's resume from the algorithm for computing the Levenshtein distance among two strings. To fill every cell of the subrpoblems matrix, we can allocate a 2 dimensional array and then proceed like this:

// C-Like syntax: don't expect it compiles :)
int lev_distance(char* string_one, char* string_two) {
int m = strlen(string_one)+1;
int n = strlen(string_two)+1;
int matrix[m][n];

for (int j=0; j<n; ++j) { // Step1: Init first row
matrix[0][j] = j;
}

for (int i=0; i<m; ++i) { // Step2: Init first column
matrix[i][0] = i;
}

for (int i=1; i<m; ++i) { // Step3: Fill every cell
for (int j=1; j<n; ++j) {
if (string_one[i-1] == string_two[j-1]) {
matrix[i][j] = matrix[i-1][j-1];
} else {
matrix[i][j] = min(matrix[i][j-1],
matrix[i-1][j],
matrix[i-1][j-1]) + 1;
}
}
}
return matrix[m-1][n-1]; // cell containing cost of the real problem
}

But, if we look carefully at Step3 inner loop, it's easy to see that there is not need to store the entire matrix since, for each cell, we need just the immediately preceding ones. Hence, to reduce allocated memory, it's possible to change the algorithm like this:

int lev_distance(char* string_one, char* string_two)
int m = strlen(string_one); // Note: removed +1
int n = strlen(string_two)+1;

int current[n];
int previous[n];
for (int j=0; j<n; ++j) {
previous[j] = j;
}

for (int i=0; i<m; ++i) {
current[0] = i+1;
for (int j=1; j<n; ++j) {
// Note: the if is replaced by the conditional assignment
current[j] = (previous[j]+1, current[j-1]+1,
previous[j-1] + (string_one[i] != string_two[j-1] ? 1 : 0) );
}
swap(current, previous);
}
return previous[n-1];
}

Now, only 2 array of size strlen(string_two)+1 are stored.

This algorithm is quite good for a singlecore cpu implementation, since it has good data locality... Small strings can be stored entirely inside the cache memory to achive the best performances. However, think about optimising this algorithm for a multicore CPU... It's easy to notice the strong data dependancy: we can't compute the value of a cell if we didn't already computed the previous ones.

To overcome this problem and take the full advantage of using multiple cores we need to change the algorithm. And these are the changes (can you find another way?).

Instead of walking over the matrix by rows and then by columns, we can walk by diagonals to avoid some data dependancies. Look at the followings steps:

1. Compute the value of cell $c_{1,1}$ (previous needed values $c_{1,0}$ $c_{0,1}$ and $c_{0,0}$ comes from initialisation) $$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & y_3 & \cdots \\ \hline \# & 0 & 1 & 2 & 3 \\ \hline x_1 & 1 & \mathbf{c_{1,1}} \\ \hline x_2 & 2\\ \hline x_3 & 3 \\ \hline \vdots\\ \hline \end{array}$$
2. Compute cells $c_{2,1}$ and $c_{1,2}$: cell $c_{2,1}$ requires cells $c_{1,1}$ (computed in the previous step), cell $c_{2,0}$ and $c_{1,0}$ (both comes from initialisation). Same for cell $c_{1,2}$. $$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & y_3 & \cdots \\ \hline \# & 0 & 1 & 2 & 3 \\ \hline x_1 & 1 & c_{1,1} & \mathbf{c_{1,2}} \\ \hline x_2 & 2 & \mathbf{c_{2,1}} \\ \hline x_3 & 3 \\ \hline \vdots \\ \hline \end{array}$$
3. Compute cell $c_{3,1}$ $c_{2,2}$ and $c_{1,3}$: cell $c_{2,2}$ requires cells $c_{1,1}$ (computed in step 1) $c_{2,1}$ and $c_{1,2}$ (from step 2) $$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & y_3 & \cdots \\ \hline \# & 0 & 1 & 2 & 3 \\ \hline x_1 & 1 & c_{1,1} & c_{1,2} & \mathbf{c_{1,3}} \\ \hline x_2 & 2 & c_{2,1} & \mathbf{c_{2,2}} \\ \hline x_3 & 3 & \mathbf{c_{3,1}} \\ \hline \vdots \\ \hline \end{array}$$
And so on...
At a given step, each cell refers to the cells computed in the previous steps, and never to the ones of the same. Precisely, filling the cells in step $i$ requires filling cells at step $i-1$ and $i-2$.

This allow us to rewrite the algorithm in a way such that all of the cells in a certain diagonal can be filled concurrently since the data dependancy has been moved among diagonals.

Unfortunately I don't have enough time to show the pseudo code of this algorithm in this post because, even if this new algorithm can be conceptually simple to understand, a good implementation requires the handling of a some special cases that will make the code snippet definitely too long to explain in this blogpost...

Instead, I'm going to push (and talk about) the code on my github account during the next blogpost.

... suspense is raising, isn't it? :)

## Thursday, September 13, 2012

### A GPU Exercise - part 1

It's been several months since the last time I put my hands on something related to GPUs. Sadly, I get very rarely the chance to get this kind of jobs.
So, to keep myself trained (just in case I will get one), during the last two weeks I worked on an "exercise project".

For this exercise project I choosed to port to GPU an algorithm that I studied during my University years. Here you can find some informations:

I'll try to explain in few words and with a little example what is the Levenshtein distance.
Given two strings, the Levenstein distance is the minimum number of character insertion, deletion or substitution, to change a string into the other, ie:

The Levenshtein distance between RISOTTO and PRESTO is 4:

1. Insert P
2. Substitute I with E
3. Delete O
4. Delete T

Finding the shortest list of operations to change a string into another is a problem that can be solved by using the Dynamic Programming technique. The main trouble is that, in many cases, this technique requires an huge amount of memory to store intermediate results and avoid recomputing.

In fact, this algorithm requires a amount of memory proportional to the product of the lengths of the 2 strings: if len(string1) is m, and len(string2) is n, then the memory and operations requirement is O(m*n).

The required memory amount can be reduced in several ways... expecially exploiting some of the upper bounds of the optimal solution (see the wikipedia link at the top of this article).

However, if we compute just the number of operations (without actually knowing which operations are performed), the memory requirement can be reduced up to O(max(m, n)), even though the number of operations remains unchanged at O(m*n).

Let's try to understand how the algorithm works, and hence, why computing the optimal cost, in terms of number of changes, is cheaper (in terms of memory) than computing the entire operations list.

Given two string $X$ and $Y$ such that $X=x_1,x_2,...,x_m$ and $Y=y_1,y_2,...,y_n$ , we can use a matrix to represent the solutions of each subproblem.

$\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & \cdots & y_n \\ \hline \#\\ \hline x_1\\ \hline x_2\\ \hline \vdots\\ \hline x_m\\ \hline \end{array}$

Where cell $i,j$ will contain the cost of the optimal solution for the subproblem of changing string $X_i=\#,x_1,x_2,...,x_i$ into $Y_i=\#,y_1,y_2,...,y_j$ with $i \leq m$ and $j \leq n$. The $\#$ character represents the empty string.
Note that the first row and column of this matrix, are quite easy to fill:

• The first column contains the costs of changing any of the $X$ substrings into the empty $\#$ string. So, substring $X_1$ (formed by the empty string plus the $x_1$ character) will require 1 deletion to change into substring $Y_0$ (the empty string)... $X_2$ will require 2 deletion, $X_3$ is 3, and so on...
• The first row contains the cost of changing the empty string into the substring $Y_j$. So, the cost of changing the empty string into $Y_1$, will be 1 insertion, $Y_2$ is 2, and so on...
• Obviusly, the cell $0,0$ containing the cost of changing $\#$ in $\#$, that is 0.
\begin{array}{|c|c|c|c|c|c|} \hline _{X}\backslash^{Y} & \# & y_1 & y_2 & \cdots & y_n \\ \hline \# & 0 & 1 & 2 & \cdots & n\\ \hline x_1 & 1\\ \hline x_2 & 2\\ \hline \vdots\\ \hline x_m & m\\ \hline \end{array}
Now that we have defined the base cases, we can look at how it's possible to fill the remaining cells, and thus finding the costs of the optimal solutions. Let's take as example the cell $1,1$ corresponding to subproblem of changing $X_1=\#,x_1$ into $Y1=\#,y_1$.

The easiest case is when $x_1=y_1$. In this case we don't need to perform any operation over the previous subproblem of changing $X_0$ into $Y_0$, hence the costs are the same.

If $x_1$ differs from $y_1$, then we can choose the cheapest among the three following options:

• Substitute $x_1$ with $y_1$. The cost will be 1 plus the cost of changing $X_0$ into $Y_0$.
• Insert the character $y_1$ to $Y_0$. The cost will be 1 plus the cost of changing $X_1$ into $Y_0$
• Delete the character $x_1$ from the string $X_1$. the cost will be 1 plus the cost of changing $X_0$ into $Y_1$
In the same way it's possible to compute the values for all the other cells, by looking at the cells associated to the immediately preceding subproblems. In the end we will get the value of subproblem $(X_m, Y_n)$, that is the problem from which we started from.

Let's make some observations:
- The value of cell $(j,i)$ depends only from the comparison between $x_i$ and $y_j$ and value in cells $(j-1,i-1)$, $(j-1,i)$, $(j,i-1)$.

- The operation that was performed to produce the value in cell $(i,j)$ can be deduced only by looking at $x_i$ and $y_j$ and value inside cells $(j-1,i-1)$, $(j-1,i)$, $(j,i-1)$.

This makes easy computing the value of a cell $i,j$ since we have to remember just the values of the immediatly preceeding subproblems.
On the other hand, to retrive all the operations associated to the optimal solution, it's necessary to store (or recompute) the values of all the preceeding cells, since it's not possible to know which cells you will need, until you know the "subproblems path" to the optimal solution.

I hope you enjoied this simple introduction to the algorithm. In the next post I'm going straight into the implementation and the problems (and the solutions) of make run this algorithm on a GPU... attaching a lot of code of course :)

And naturally, if you spot errors, I would be really happy to receive reports ;)

## Saturday, September 1, 2012

### Debug pointer: quick overview

Let's focus on some keypoints of my debug_ptr:

• What a debug_ptr is used for?
A debug_ptr can be used like a standard pointer type to hold the address of a dynamically allocated object:
typedef_pointer(int*, int_p);
int_p p = new int(0);

• How could it be useful?
If you forget to delete a dynamically allocated object before loosing its last reference, you are probably creating a memory leak. A debug_ptr warns you about this fact using a warning policy.
struct ThrowPolicy {
ThrowPolicy() {
throw std::runtime_error("lost reference to undeleted object");
}
};
struct PrintPolicy {
PrintPolicy() {
std::cerr << "WARNING: lost reference "
"to undeleted object" << std::endl;
}
};

• How much does it cost? (in performance terms)
The features of debug_ptr are mostly useful in "debug mode". In a production release, you may not want use the features of a debug_ptr class (although you can), because of the performance losses due to the reference counting.
• How did I address this issue?
You can compile your software by defining ENABLE_DEBUG_TYPES to get the debug_ptr feature turned on. Otherwise all the debug_ptr features are turned off by a macro that will substitute any occurrence of debug_ptr<T> with a T*. This will avoid any overhead at compile and run time.
#ifndef ENABLE_DEBUG_TYPES
#define typedef_pointer(Type, Alias, ...) typedef Type Alias
#define typedef_array(Type, Alias, ...) typedef Type Alias
#else // ENABLE_DEBUG_TYPES
...
#endif

• How debug_ptr syntax differs by the one of a standard pointer type?
debug_ptr it's intended to be used just like any other smart_ptr, and in most of the cases you should not notice any difference in creating, copying, dereferencing, assigning, and deleting...
typedef_pointer(int*, int_p);
int_p p = new int(0);
int_p p_other = new int(0);
*p = 63;
int_p p_other = p;
delete p_other;

• How did you get the same delete syntax of a standard pointer type?
By defining a default cast operator to a pointer-to-deleter-class. The cast operator will be implicitly called and a new deleter object returns to operator delete. So, delete operator, deletes the deleter-class.
operator deleter*() const {
return deleter::new_deleter(pd_);
}


And that's all folks!

If you like this project, or have any other ideas about what could be the next step of the debug_ptr, fork the project on github and have fun :)
git clone git://github.com/cybercase/debug_ptr.git # clone the repo
cd debug_ptr
g++ -Wall -o test debug_ptr_test.cc -DENABLE_DEBUG_TYPES
./test


## Wednesday, August 22, 2012

### Debug pointer

As I mentioned in the last post, lately I'm working on a C++/Boost software (aka speeding up and refactoring an unmaintainable software). This software is just a kind of parser. It parses several binary files and, for each one, produces a xml file. Also it needs to merge and summarise some of the fields contained in each one of these files to produce a "Final Report" xml file.

There may be several way to write a software for a task like this, and the customer choosed to use threads. A big mistake, because (IMHO) Who wrote the software did not have any idea of how use threads efficently, and he ended up using a lot of locks and semaphores everywhere... but this is another story.

The most impressive thing that I saw in this software was the inconsiderate abuse of Boost shared pointers. These were used almost everywhere in the software in every nonsense way they could be used... passing them around by copy or just like the following:

 for (int i=0; i<SOMEINT; ++i) {
// Why not: SOMECLASS sp; ???
boost::shared_ptr<SOMECLASS> sp(new SOMECLASS);
sp->SomeMethod();
}


Just by fixing these kind of misuses, I've got a speedup of over 2x.

Shared pointers can be really useful but:
• Should be used with care because they can become really expensive
• Should not be used just to avoid thinking about memory management in your software
And the last point is the one that I felt was the case of this software. For this reason, and for helping myself to refactor and remove shared pointers from this software, I wrote my own smart pointer: debug_ptr.

debug_ptr is a smart pointer that should help the developer to manage the memory correctly (a delete for every new). A debug_ptr just warn the developer when he forgets to delete an dynamically allocated object, so that he can insert the correct delete (or delete[]) statement.
Also, debug_ptr is designed in a way such that, once you have tested your software and no memory leaks warning are raised anymore, it can be automatically substituted with a native pointer type, so that no overhead is added both at run and compile time.

For now you can check the debug_ptr source code and have a look at the few test cases that show how it can
be used. It's located in my debug_ptr repository on github:
git clone git://github.com/cybercase/debug_ptr.git # clone the repo
cd debug_ptr
g++ -Wall -o test debug_ptr_test.cc -DENABLE_DEBUG_TYPES
./test

In the next post, I'll try to focus on some interesting parts of this very simple smart pointer.

## Saturday, July 28, 2012

### Notime

No matter how hard I try, I still can't find enough time to keep up to date this little blog. Even simple technical writing takes a lot of time (especially in a foreign language).

My last blogpost was a sort of introduction to my first experience with Javascript. Until now, because of a new job based on an entirely different technology (C++ / boost), I still have not found the time to do a little summary of this experience.

So, for now, everything I can show you is in the CYBERBUBBLE widget at the bottom right side of this page.

Have fun!

Ps.
It might not work on your browser, especially if using Internet Explorer ;)

## Thursday, June 7, 2012

### A Javascript experiment - part one

It's been a couple of months since my last writing and, despite the lack of posts, something has happened.

I have jumped into the javascript world!

I would never have thought about javascript as my next language to learn. Actually, I always though about javascript as a sort of lame-language... don't know really why... Maybe because of all the times that I've been tricked in some "funny" web page that started to endlessly popping up annoying message boxes.

However, when one of my most respectable colleague gave me some hints about this language, I decided to give javascript a chance.
The most attractive thing about javascript is obviously that, together with HTML5, it can run on every machine supporting at least one of the many available modern web browsers... And I'm not just talking about personal computers with different operating systems, but also about mobile and tablet devices.

"No deploy, no multi platform issues, write once run everywhere... really everywhere!"

This is what I was dreaming just before starting my adventure :)

Now, obviously, my dream has been resized by the reality that such a language is an Utopia. In the next blogpost (I think it will not take too long) I'll tell you of my first HTML5-javascript experience.

In the meantime I can suggest the Javascript: The Good Parts book to anyone looking for a enough in deep overview about the javascript "good features".

## Saturday, March 10, 2012

### The concept of minimum

It's been a while from my last post on this blog... Too much working commitments and so few free time :-/

However lately, I'm reading a lot! It's much easier for me to find out the time to read an article, or source code from some project, than writing my own code... Writing good code takes actually not little time, so I thought that I can share some of my readings with the few readers of this blog. Better than nothing, isn't it?

So, let's start with this old article from Dr.Doobs http://drdobbs.com/cpp/184403774 I just finished to read.
I would never thought that a so simple question about implementing the min and max C macros using C++ templates would require such a long answer (and code).

Even though lately I'm starting to think not-so-good about C++ (but this is not the post to talk about this), I find this kind of problems so mind teasing.

And also... look at the author of the article! Yes, it's him :)

## Sunday, January 15, 2012

### Sorting priorities

Let's say you have your 3d colored point type:

struct MyPoint
{
// A 3d colored point
MyPoint(int c, int x, int y, int d) :
color(c), x(x), y(y), d(d) {}
int color;
int x;
int y;
int d;
};


You define a vector of MyPoint objects carefully ordered by color, depth, height and width.

int width = 100;
int height = 100;
int depth = 5;
int colors = 2;

int len = width * height * colors * depth;
std::vector<MyPoint> pts;
pts.reserve(len);

for (int c=0; c<colors; ++c)
for (int d=0; d<depth; ++d)
for (int y=0; y<height; ++y)
for (int x=0; x<width; ++x)
pts.push_back(MyPoint(c, x, y, d));


Then one of your funny colleagues put a

std::random_shuffle(pts.begin(), pts.end());


just the line before you are scanning the ordered array of MyPoint... If it's not a colleague, could be that you need to apply a transformation matrix to all of your points, or whatever else that will break the color, depth, height and width ordering of your set.

To order back your vector by the same priorities, you can use the STL sort algorithm with some custom comparison function.
The the first function I wrote was this:

bool mypoint_all_sort_fn(const MyPoint& p0, const MyPoint& p1)
{
if (p0.color < p1.color)
return true;
else if (p0.color == p1.color)
if (p0.d < p1.d)
return true;
else if (p0.d == p1.d)
if (p0.y < p1.y)
return true;
else if (p0.y == p1.y)
if (p0.x < p1.x)
return true;
return false;
}


and then

std::sort(pts.begin(), pts.end(), mypoint_all_sort_fn);

I found this solution the ugliest among the others. It's not scalable, has too many indentations and strong code dependancy... And also if mypoint_all_sort_fn was defined as operator inside MyStruct declaration I wouldn't like to have such a function inside my software.

The second solution I came to was using several stable_sort

bool mypoint_x_sort_fn(const MyPoint& p0, const MyPoint& p1)
{
return p0.x < p1.x;
}
bool mypoint_y_sort_fn(const MyPoint& p0, const MyPoint& p1)
{
return p0.y < p1.y;
}
bool mypoint_d_sort_fn(const MyPoint& p0, const MyPoint& p1)
{
return p0.d < p1.d;
}
bool mypoint_c_sort_fn(const MyPoint& p0, const MyPoint& p1)
{
return p0.color < p1.color;
}


and then

std::stable_sort(pts.begin(), pts.end(), mypoint_x_sort_fn);
std::stable_sort(pts.begin(), pts.end(), mypoint_y_sort_fn);
std::stable_sort(pts.begin(), pts.end(), mypoint_d_sort_fn);
std::stable_sort(pts.begin(), pts.end(), mypoint_c_sort_fn);


I found this solution not too bad by the code style point of view, but it's not the same about performances. More are the values you want to sort by, more are the stable_sort steps you have to do.

The last solution, which I found the most fascinating one, was suggested to me by two of my collegues, and it took me a just a small bunch of time to write the code down.

template <typename T> int element(const T&, int i);

template <typename T, int I> struct Orderer
{
static bool compare(const T& t0, const T& t1)
{
if (element(t0, I) == element(t1, I))
return Orderer<T, I-1>::compare(t0, t1);
else
return element(t0, I) < element(t1, I);
}
};
template <typename T> struct Orderer<T, -1>
{
static bool compare(const T& t0, const T& t1)
{
return false;
}
};


Using templates it's possible to make the compiler generate the ugly if-else code. Then the task of providing the right element for the sorting is demanded to an element(int) function. Also, it's possible substitute the element(int) function with an element(int) member function (or maybe an operator) in MyPoint definition. This is how it work:

template <> int element<MyPoint>(const MyPoint& p, int i)
{
// Watch out!
// The higher is the i-value, the most significant is the member
switch(i)
{
case 3:
return p.color;
case 2:
return p.d;
case 1:
return p.y;
case 0:
return p.x;
default:
throw std::invalid_argument("Undefined element");
}
}
bool mypoint_sort(const MyPoint& t0, const MyPoint& t1)
{
return Orderer<MyPoint, 3>::compare(t0, t1);
}


and then

std::sort(pts.begin(), pts.end(), mypoint_sort);


That's it! A more readable switch-case statement provides the priority of sorting.... and the mypoint_sort function call the template right specialization of Orderer Struct static compare function.

Here how you can find, compile and run the code for this example:

git clone git://github.com/cybercase/funproject.git
cd funproject/other
g++ order.cpp -o order -Wall


One interesting thing I discovered while writing the code of this example, is that you can't use template function specialization. I found a discussion about this topic at http://www.gotw.ca/publications/mill17.htm. The Peter Dimov and Dave Abrahams example shows why template specialization can't be done with functions.