Tuesday, March 31, 2009

Linear Algebra and Matrices in C++ and Boost

Quite frequently you have to deal with matrices, eigenvectors and linear solvers. Boost offers (at least) two different and complementary approaches:
  1. uBLAS, A template interface over the traditional set of Blas libraries. This is very useful for people who are used to Fortran Blas and who wants to use it with C++. uBLAS provides BLAS level 1, 2, 3 functionality for dense, packed and sparse matrice. Linear algebra is supported by interfacing directly with underlying blas libraries. Here an Overview of Matrix and Vector Operations. Other bindings to different libraries are also available.
  • Pros: vector and matrix types are defined by using templates; Blas is very efficient;
  • Cons: Personally, I find the Blas notation a little difficult to use.
  1. MTL4, another template interface with can use either Blas or its internal pure C++ numerical algorithms.
  • Pros: I found MTL quite intuitive: matrices have many useful type definitions, and the matrix operations are rather complete. Performance is quite good, altought the distribution site is not reporting an estensive testing;
  • Cons: MTL is not stable and is not part of standard boost distribution.
I would love to hear what is your favourite solution for Linear Algebra and Matrices in C++.

Monday, March 30, 2009

Decision Trees (part I)

Let's talk about our beloved Decision Trees (DTs). I have a very good friend who always suggests using a decision tree in every data mining task. After all, we use DTs every day in our real life.

What is a decision tree? nothing more that a long sequence of "if .. then else .. if then else .. if then .. else ... " ;-) This definition is not making justice to the power of DTs. The sequence of best choices is not pre-defined, but it is the result of a machine learning process on the training dataset. In DTs, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. All the binary structure of the DT is learnt automatically.

Training dataset is seen as a matrix, where each row is a record. For each row in the dataset, for each feature, the dataset is split in two parts A and B. An auxiliar score function (such as entropy or gini impurity) is used to compute the information gain for A, B. The feature obtaining the best information gain is used as split feature [ e.g. if (... feature_chosen ... ) then ... else ... ]. A node N is created containing the splitting feature. The left child and right child of N are built recursively by applying the same learning process on the subsets A and on B, respectively.
DTs can work with both numerical and categorical (e.g. strings) features.

Here you have the C++ code for building the DT (note that features are represented with boost::variant).

In a future posting, we will discuss about pruning DTs and machine learning classification using DTs.

Sunday, March 29, 2009

Linear Suffix Array

One article out of three thousands is beautiful. "Simple Linear Work Suffix Array Construction" (Juha Karkkainen and Peter Sanders, 2003) is a simple and elegant linear algorithm for building suffix array in linear time. Back in 2003, When I saw the article the very first time back, I thought: "This is a recursive diamond".

Suffix Arrays (SA) are a powerfull tool used in many different fields such as genetics, string matching, compression, and many others. A suffix array A of T is just a sorted array of suffixes of T. To avoid quadratic space we store only the indices of the suffixes, instead of full suffixes of T. Given SA(T), one can search for occurrences of P directly on the suffix array using binary search in O(|P| lg |T|) time. This can be improved to O(|P| + lg |T|).

SA were originally developed by Gene Myers (ex-Celera) and Udi Manber (now at Google). For more than a decade, the direct construction of suffix array was O(N^2). SA were built in linear time indirectly by building an intermediate suffix tree (which are expensive from the point of view of memory). SA is equivalent to an in-order traversal of the leaves of the suffix tree.

Here a succint description of the algorithm:
  1. Sort Σ, the alphabet of the T in linear time by using radix sort;
  2. Replace each letter in the text with its rank among the letters in the text;
  3. Divide the text T into 3 parts:
  • T0 = < (T[3i ], T[3i + 1], T[3i + 2]) , i>=0 . . . >
  • T1 = < (T[3i + 1], T[3i + 2], T[3i + 3]) , i=0>
  • T2 = < (T[3i + 2], T[3i + 3], T[3i + 4]) , i >= 0 >
  1. Recurse on the concatenation of T0 and T1. When this recursive call returns, we get T0 and T1 sorted in a suffix array. After that, we need is to sort the suffixes of T2, and merge them with the old suffixes to get suffixes of T. This is the point where the algorithm become a diamond: the construction of the suffixes of T2 and the linear merge.
  2. Note that by construction T2[i:] = T[3i+2:]= (T[3i+2], T[3i+3:]) = (T[3i+2], T0[i+1:]). Therefore we can have T2[i:] from the already built To[i+1]
  3. Merge the sorted suffixes of T0, T1, and T2. Remember that suffixes of T0 and T1 are already sorted, so comparing a suffix from T0 and a suffix from T1 takes constant time. To compare against a suffix from T2, we decompose it again to get a suffix from either T0 or T1;
Since this sorting and merging is done in linear time, we get a recursion formula T(n)=T(2/3n)+O(n), which gives linear time. You can prove it by substitution -- intuitively note that 2/3 C++ code for suffix array contained therein. It is neat, simple and instructive.

Saturday, March 28, 2009

N cooperative girls wearing an Hat

This is a classical and instructive puzzle. In a room there are N girls who wear either a Red or Black hat. The color is decided by tossing a coin.

Each girl cannot see her own hat, but she can only see hats worn by other girls. Girls cannot communicate each other. The team wins if at least one of them guesses correctly, and none of them guesses incorrectly. What is the team's best strategy? (In other words, who should talk and who should say in silence?)

Friday, March 27, 2009

A comment on this blog

manthrax@reddit "Man I like this dudes style. Short and sweet explanations/references of the algorithms and then an implementation in gods own C++. Respek! And thanks for posting!"

Genetic Programming

Computer Scientists like to borrow ideas from other disciplines! I already reviewed Simulated Annealing which has been inspired by Physic and atom movements. Now a quick discussion about Genetic Programming (and Evolutionary Computing), which has been inspired by Biological evolution and Genetics.

Genetic Programmig is another meta-heuristic for Optimization. The key idea is to find a (local) optimal solution by starting with a feasible population (a collection of non-optimal solutions). The initial population evolves in different iterations. Two key steps are used to evolve: a solution can mutated or crossovered. The best solution is selected among the evolutions of the initial population.

If you want to know more about Genetic programming and Evolutionary computing, there are thousands of books and papers. Personally, I like they unified approach presented in Evolutionary Computation by Kenneth A. De Jong, 2002. For those interested in Web search, let me point out that Genetic programming has been successfully used for Learning to rank (see this "optimizing evaluation measures in learning to rank" and the references within)

Here you have the code for Genetic Programming in C++

Thursday, March 26, 2009

Good guys and bad guys

You are given a set G of n good guys and a set B of m bad guys. Whenever the a good guy meet a good guy, happiness increase and nothing changes. With probability p the bad guy kills the good guy, if they get in touch (while with probability 1-p, the good guy will survive). If two bad guys meet each other, they will not survive.

Question is: who will survive, and with what probability?

Wednesday, March 25, 2009

Simulated Annealing

Simulated Annealing (SA) is a smart (meta)-heuristic for Optimization. Given a cost function in a large search space, SA replaces the current solution by a random "nearby" solution. The nearby solution is chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (a.k.a the temperature). T is gradually decreased during the process. The current solution changes almost randomly when T is large, but increasingly "downhill" as T goes to zero. The allowance for "uphill" moves saves the method from becoming stuck at local minima.

This approach has some similitude with Physic, where the heat causes the atoms to become unstuck from their initial positions and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one.

Simulated Annealing is also used for many different data mining tasks such as Classification, Clustering, and Web search Ranking.

Here you have the code for Simulated Annealing in C++

Tuesday, March 24, 2009

Book review: Introduction to Video Search Engines

Introduction to Video Search Engines by David C. Gibbon and Zhu Liu (2008) it's a nice book about video search engines by AT&T people.

I recommend the book if you want an introduction to the Video search world. The book discusses important topics such as video standards, internet video, video classification by sources and by topics, the anatomy of a video search engine, and many others. Particularly interesting are the four chapters about Media Processing, Video Processing, Audio Processing and Text Processing.

In Video Processing, I appreciated the discussions about shot boundary determination and representative image selection with evaluation based on TrecVid dataset. Other interesting topics are Face detection, which can be used for ranking. and Face recognition.

In Audio Processing, I appreciated the discussions about Speaker segmentation and Speaker recognition.

Text Processing, is a more standard chapter with discussions about Named Entities recognition and Part of Speech tagging.

I would have appreciated a more detailed discussion about Video Search Ranking which would have been appreciated by Seo Community and a more detailed discussion of Video Search anatomy.

Anyway, I definitively recommend reading this book, if you you are interested in Video Search

Leaders' competition

Leaders have hard times these days: expectations are high and hope is great. This is a little puzzle, with a lesson inside: a cooperative strategy is important.

You have a list of N Leaders, ordered from the most important to the less important. Each Leader comes with a proposal about how to partition the money among different projects proposed by other leaders. Here are the rules of the game:
  1. Leaders make proposals in order of their importance;
  2. The proposal of the leader i are voted by leaders i, i + 1 , .... N;
  3. A proposal is accepted if it gets the majority of the voting leaders
  4. If a proposal is rejected than the Leader i who proposed it is no longer a Leader, and the proposal of Leader i+1 is therefore voted;
Suppose you are the Leader 0 (e.g. the most important one), what is your cooperative strategy for getting the other Leaders voting your proposal? What if you are Leader i-th can you have your own plan approved?

Monday, March 23, 2009

Sparse Sets with O(1) insert, delete operations and (sub)linear union, intersection

The paper An efficient representation for sparse sets (1993) by Preston Briggs, Linda Torczon, introduces an efficient data structure for representing a sparse static set (with u elements inserted out of a maximum of n elements). Insert and delete operations have a constant time complexity, while union and intersection are (sub)-linear in n. The space complexity is linear in n. The table on the right side compares the complexity of this data structure with a traditional bit vector.

The key intuition is to maintain two vectors such that dense[sparse[k]] = k, and a variable which maintains the number of elements in the sparse set. sparse[k] contains the number of elements in the set, when element k is inserted. In this way two elements inserted consecutively are mapped into contiguous positions of the dense vector. As a consequence, insert and delete operations are constant, while intersection and union operations are sub-linear.

This is the code for Sparse Sets representation in C++.

Sunday, March 22, 2009

Introsort:: mixing two words quicksort and heapsort

Introsort is a simple and effective algorithm which guarantes a sort as fast as Quicksort, but with Θ(N lg N ) worst case. It's well known that Quicksort has O(N lg N ) average complexity and O(N ^ 2) in worst case. This is true even for randomized variant of quicksort, or for the median of 3 pivoting strategy. Anyway, Quicksort is 2-5 time faster than Heapsort, which has Θ(N lg N) proven worst case.

In 1997, proposed to start with Quicksort and switch to Heapsort when the recursion depth is O(logN). This simple variant makes the Introsort algorithm Θ(N lg N ) worst case. Introsort is currently adopted as standard unstable sort by the Standard Template Library in stl_algo.h
"Introspective Sorting and Selection Algorithms". Software: Practice and Experience (Wiley) 27 (8): 983–993.

Saturday, March 21, 2009

An array of integers sorted and shuffled

You are given a sorted array A of N integers. For each integer i, a random number j is extracted and A[i], A[j] are swapped. When this shuffling is complete:
  • What is the probability that exist k such that A[k] = k?
  • What is the probability the A is still sorted?

Friday, March 20, 2009

The beauty of linear k-order selection

One out three thousands papers is a beautiful paper. I believe that "Time bounds for selection" (M. Blum, R.W. Floyd, V. Pratt, R. Rivest and R. Tarjan, J. Comput. System Sci. 7 (1973) 448-461) is an old and beautiful pearl.

It's well know that K-selection is a problem that can be solved in O(N logN) average time by using an approach similar to the one of QuickSort . Anyway, the worst-case is quadratic even when you adopt a randomization step (here you have the C++ code for Quickselect).

The above paper introduces a beautiful variant of classical selection. The key idea is to partition the input data array A in groups of five elements. Then:
  1. Within each group G[i], find the median(G[i]) in constant time by direct comparison of the 5 elements; This step is at most linear;
  2. Take the median of each group and find M, the median of medians (median(median(G[0], .... G[floor(n/5)+1]) by a recursive call; This steps takes T(n/5);
  3. Use M to partition the input by calling the quickselect algorithm recursively; This step takes at most T(7n/10) -- see below;
At each step of the recursive call, QuickSelect drops either the elements greater than M (call this GTM) or the ones less than M (call this LTM). Let's assume we maintain the elements less than M. How many are they?
  • Among the n/5 medians, n/10 are larger then M, being M the median;
  • For each median, at least two other elements in A are larger than the median by construction;
  • Therefore, GTM is at least 3n/10;
  • For symmetry, LTM is at least 3n/10;
As a consequence the recursive call to QuickSelect takes at most T(7n/10);

The total cost is therefore T(n) <= Θ(N) + T(n/5) + T(7n/10)

This recursion equation is linear (it can be proven by substitution method). The intuition is that 1/5 + 7/10 = 9/10 and therefore the geometric series is converging.

Thursday, March 19, 2009

An array of n bits, again

I love simple problems on simplest data structure: the infamous array of N bits. The linear scan of the array is O(N) and everyone knows this. Here I am asking: what is the probability of finding in position K-th the first bit set to 1 and all the preceding bits set to 0?

Now, Suppose that two sources are sending a stream of packets on the same wire and that they are able to detect collisions. After K failures, they suspend the transmission for random number of milliseconds in [0, 2^K-1]. What is the probability of having another failure?

Wednesday, March 18, 2009

Adaboost : improve your weak performance

Adaboost is one of my favorite Machine Learning algorithm. The idea is quite intriguing: You start from a set of weak classifiers and learn how to linearly combine them so that the error is reduced. The result is a strong classifier built by boosting the weak classifiers.

Mathematically, given a set of labels y_i = { +1, -1 } and a training dataset x_i, Adaboost minimizes the exponential loss function sum_i (exp - (y_i * f(x_i)) }. The function f = sum_t (alpha_t h_t) is a linear combination of the h_t classifier with weight alpha_t. For those loving Optimization Theory , Adaboost is a classical application of Gradient Descend.
The algorithm is quite simple and has been included in the top 10 data mining algorithms in 2007 and the Gödel prize in 2003. After Adaboost, Boosting become quite popular in the data mining community with application in Ranking and Clustering.

Here you have the code for AdaBoosting in C++ and Boost.

Beautiful Algorithmic Books

A beautiful course on-line from MIT

Tuesday, March 17, 2009

Fast String Hashing

Gio introduced me to the magic of FNV Hash a special kind of hash function well suited for hashing strings. Operations used for FVN hashing are bit-shift and xor. So they are pretty fast and guarantee to have few collision. No code this time, since the above web page has it already.

A comprehensive list of hash function is available here.

Update: Gio introduce me to the ultra-fast HSIEH string hash functions, which outperforms FNV hash. Again the code is in the page.

Serializing a Generic Tree

Boost Serialization offers a flexible tool for generic serialization. What if you can serialize a whole tree T, with something as simple as
oa << root 
and library serializes single node in T regardless tree shape and the type of each generic. This code explain how to serialize a generic tree. (Thanks gio for fixing that nasty bug!)

Monday, March 16, 2009

Segment covering

Given several segments of line with coordinates [Li,Ri], choose the minimal amount of them, such they completely cover the segment [0,M].

Generic Tree

Boost::ptr_container is the standard solution for containers of pointers of heap-allocated objects in an exception-safe manner and with minimal overhead.
The applications are very interesting. For instance, it is easy to write a Generic Tree code and use it as in:

Inner<std::string> root("root");
root.add_child(new Leaf<std::string>("test son of root"));
Inner<int> * n1 = new Inner<int>(1);
root.add_child(n1);
n1->add_child(new Leaf<std::string>("test son of integer"));
depth_first(root);


I'm working on a version of the Generic Tree which supports serialization, but this is hard to achieve according to this post

Sunday, March 15, 2009

Pokemon

There are M^3 Pokemons characterized by three different skills ("attack", "defense", "beauty"), each ranging from 1 to M. A subset containing N of the Pokemon are the bad pets. We want to select one of the other Pokemon to be a Hero. The Hero must beat each of the bad pets in at least one skill (not necessarily the same skill for all the bad pets).

The task is to compute the number of Pokemon that can be selected to be the Hero.

Saturday, March 14, 2009

Different sea shells

You are given 12 sea shells with same shape and colour. 11 of them have the same weight, while 1 has different weight. Can you identify the different one by using a scale balance just three times?

Friday, March 13, 2009

Memory Mapped files in Boost and C++

Memory Mapped I/O is a crucial methodology for increasing I/O performance. The standard I/O approach is costly due to system call overhead and memory copying. Anyway, the memory mapped approach has its cost in page faults which should be carefully handled.

Before BOOST there was no portable API for memory mapped I/O. POSIX-compliant systems, such as UNIX, Linux, Mac OS X or OpenVMS supports the Posix function mmap(), while Microsoft Windows supports CreateFileMapping().

Boost C++ provides a portable implementation of memory-mapped files for Microsoft Windows and POSIX-compliant platforms.

Here you have the code for portable memory mapped file reading in Boost.

Fibonacci, a mathematician from Pisa

Leonardo Pisano (Leonardo Fibonacci, or simply Fibonacci meaning "filius nacci", the son of Nacci) was born in Pisa, Italy in about 1170 (the place where I live). He is well known for his Fibonacci Numbers.

Can you compute the Fibonacci numbers in optimal time?

Thursday, March 12, 2009

Beautiful C++ books

C++
C++ STL and Boost
C++ Templates and Generic Metaprogramming

Stack of bits, what will remain?

You have a stack A of 256 bits, 13 are set to 0 and 243 are set to 1. In addition you have an additional stack B of 128 bits all of them set to 1. Until A is empty, you extract two bits from the top of the stack A:
  • if the two extracted bits are different, then you discard the the bit set to 1 and push back the bit set to 0 into the stack A ;
  • if the two extracted bits are equal, then you discard both of them and push into the stack A the a bit extracted by B;
What will be the last bit in the stack A?

Wednesday, March 11, 2009

An array of N bits, split in two parts with same amout of bits set to one

You are given an array A of N bits. I tell you that K << N are set to one, and the remaining are set to 0. You don't know what bits are set to one and what are not. You just know the amount.

What is the optimal time and space complexity to split the array in two parts such that the number of bit set to one are equal in both the parts? The two parts may be or many not be contiguous. In the latter case, you can provide two sets of indexes for A as answer.

A beautiful multi-core optimization

The article Fast String Search on Multicore Processors describes an interesting implementation of classical Aho-Corasick string matching algorithm on the top of the Many-core Cell processor.
It's impressing the speed-up improvement you can reach when you hash both the automata states and the input symbols ids. This reduces the contention in main memory access and it allows to leverage all the power of your playstation.

Tuesday, March 10, 2009

Beatiful Books in IR, Data Mining and Machine Learning

Information Retrieval & Web IR
Data Mining
Machine Learning
Programming

Hashing, Shingling and HashTrees

Hashing is my favorite methodology for data mining, machine learning and information retrieval. For computing “similarities” between a set of N objects (such as text, images, videos, web pages, etc), you need two steps:
  1. define a distance metric among objects;
  2. compute the matrix of distances in O(N^2);
In many situations, this is too expensive. Here it comes the magic of hashing. Each object is hashed to a bucket. The hash becomes the fingerprint of the object. Then, you can safely compare only to objects belonging to the same bucket. In many practical application you have a resulting complexity which is almost linear. The process becomes:
  1. define the fingerprint of the object;
  2. map the object to buckets; one different bucket for each different fingerprint;
  3. compare directly the objects in the same bucket.
Shingling is one of my favorite hashing technique. Here a fragment of text (a shingle) is hashed with multiple hash functions. Intuitively, if you take two different fragments the more similar they are, the more frequently multiple hash functions will produce identical results (when they are applied to the fragments). Shingling has a rock solid mathematical foundation. In a previous post, I enclosed a C++ code for shingling.

Another interesting hash technique is the HashTree construction. Here an order set of items (itemset) is hashed with an hash function. The hash of item in position i-th is used to select a specific children of the node currently visited. In short, hashing is used to drive the visit of a path from the root to a leaf. Each leave of the HashTree contains a bucket, where similar itemsets are stored. The main difference with shingling is that order is preserved and that to items are considered as candidate for similarity if their hash collides in the first k positions. It turns our that this property if very useful for computing frequent itemsets in the Apriori algorithm. Here you have the C++ code for HashTree.

Monday, March 9, 2009

Tower of cubes

You have a set of solid 3-d cubes. Each cube has 6 faces, with different colors on each face. Cubes have different weights. One cube can be put on the top of the other if and only if (a) the two touching faces have the same color, and (b) the lighter cube is above the heavier one.

Find the height of longest tower that you can build, in optimal space and time.

Sunday, March 8, 2009

Network of proxies

Suppose that you have N web clients connected on a network. The network is represented with a graph G = (N, E). A certain number of clients can cache a small amount of the content they download. In this case, they can act as proxies for other clients.
  1. How do you choose the clients which are promoted to proxies?
  2. How do they index their content?
  3. What content do you prefer to cache?
Can you think about metrics to evaluate your proposed methodologies for (1, 2, 3)?

Saturday, March 7, 2009

Myriad of virtues of Weka data mining toolkit

Weka is the swiss knife for data miners, with support for classification, clustering, regression, features selection and filtering. Data Mining book by Ian H. Witten and Eibe Frank discusses the 2005 version of Weka. I suggest using Weka for a rapid evaluation of your data mining protototypes and then a c++ robust implementation of the specific algorithm chosen.
Here you have the Weka documentation available on line with the Weka 3.6.0 manual.

Friday, March 6, 2009

Edit step ladder

Yet another edit step game. This time you have a dictionary of words D. A word wi can mutate into a word wj if a single character is added, deleted, or changed. This operation is called edit step. An edit step ladder is a lexicographically ordered sequence of words w1... wn, such as the transformation from wi to wi+1 is an edit step. Find the longest edit step ladder in D in optimal space and time.

Thursday, March 5, 2009

Web mis-spell, edit distance and graph search

"Britnay spers" who is that girl? ... Hmm this name is a common misspell of the "Britney Spears". Misspelling is an important part of Web search. You may want to suggest the "correct" and "closer" alternative query, when a user submits the "wrong" one. Under certain conditions, you may want to re-write the query in a transparent way and provide directly search results for the "correct" alternative query.

Here it is a set of spelling related problems.
  1. Given two strings s, t such that |s|=n, |t|=m, compute the edit distance(s, t). You may assume different costs for insert, delete, and substitution. You can make this in O(mn) and space O(min { m, n } )
  2. Given a set of words S = { w1, ... wk } an accettable bidirectional mutation (wi <-> wj) for the word wi is a word wj, such that edit_distance(wi, wj) < T. Suppose you can access a large query log Q of queries previously submitted to a search engine. Suppose that a user rewrites the word wi into the word wj within a distance D of queries submitted during her search session. In this case, you call the mutation (wi -> wj) real. Note that this mutation is in one direction only. A real accettable mutation is an accettable mutation which is both real and acceptable. A frequent real mutation is a mutation which appears in at least F different users' sessions. Given a configuration (Q, T, D, F) find all the frequent, real and acceptables word mutations. What is the optimal complexity in time and space?
  3. Find the diameter of the graph G formed considering all the frequent, real, acceptable word mutations (wi -> wj).
  4. A sentence is an order sequence of words. For instance, "britney spears" is different from "spear britnery". How can you deal with mispelling in case of sentences?
Here you have the C/C++ code for edit distance.

Wednesday, March 4, 2009

K-means in C++

K-means is a classical clustering algorithm..
Here you have a C++ code for K-means clustering.

(Edit: 12/05/013)
See also my more recent posting A new lighter implementation of K-means

Tuesday, March 3, 2009

Monday, March 2, 2009

Sunday, March 1, 2009

You can solve Rubik in 25 moves

Tomas Rokicki has proved that Twenty-Five Moves Suffice for Rubik's Cube. Actually, he proved that no configuration takes 26. He also gives a nice greedy algorithm. My only perplexity is that a 180 degree rotation is considered a move. When I was young I thought they were two moves.