Saturday, January 31, 2009

Bandwidth reduction, permutation of docs, fast pagerank

A common problem with research is that same results are achieved again and again with different names and in different contexts.

Matrix Bandwidth reduction is a well-known numerical analysis technique to reduce the bandwidth of a matrix. A quite famous method is the Cuthill-McKee algorithm. Other well-known techniques are the king ordering and the min degree ordering.

Recently, the similar techniques for bandwidth reductions have been studied in the context of document ids permutation for indexing compression (Index Compression through Document Reordering, Investigating the Effectiveness of Clickthrough Data for Document Reodering)

Lately, an interesting application has been suggested for Fast PageRank Computation

Boost provides an elegant and flexible support for Generic Graph representation. It also provides support for matrix bandwidth reduction.
Here you have the C++ code for boost matrix reodering.


Friday, January 30, 2009

Multiplying the other elements of an array in linear time

You are given an array X with n elements. Can you compute in linear time a new array A with n elements, where A[i] is the product of all elements in X, but X[i] is excluded. You cannot use division, you may use additional space

Thursday, January 29, 2009

Magy: how can you improve the ranking time?

Magy!'s engineers are spending too much time in ranking search results. Their AlgoGuru said "well, we have n results for this search and this will cost us O(n * log n), we cannot be better than this. We must resort into heuristics and approximations such as stopwords, caching, index pruning, and the like".

A young and proud eng stood up and he said "Hey, I know that you are wrong. We can make this better from an algorithmic point of view and still get exact results".

Do you know why?

Wednesday, January 28, 2009

Finding a loop in a list

Suppose you have a list of integers and you don't want to use any additional storage.
  1. Can you detect if the list has a loop?
  2. If it has, can you identify the node with two incoming edges?

Counting frequent words in Boost

Counting frequent words contained in a file is a quite common operation.
It is surprisingly easy if you adopt Boost::Unordered_map and Boost::Tokenizer.
Here you have an example of code

Tuesday, January 27, 2009

Counting cubes and hypercubes

Consider a cube C1 (k=3 dimension) and suppose it has the length of a side n=3 (as the (g)old Rubik's cube).
  • How many cubes there are on the surface of C1?
Now consider a cube C2 (k=3 dimension) and suppose it has the length of a side n (see Rubik Variations in wikipedia for examples of n=4, 5, 6, etc)
  • How many cubes there are on the surface of C2?
Now consider an hypercube, a generalization of a cube in k dimensions.
  • Let's start with k=4 dimensions. How many hypercubes there are on the surface of n * n * n * n hypercube with k=4?
  • Now, generalize to k generic dimensions. How many hypercubes there are on the surface of n * n ..... n * n (k times) hypercube with k dimensions?
I like this problem since it's hard to guess a formula for the final question, but it is easy to derive it if you solve simpler problems first.

Monday, January 26, 2009

More than linear?

You have 10 bags full of colored balls. Nine bags contain N balls, and each ball weights 10 grams. One bag contains M balls, and here each ball weights 9 grams.

Can you identify the bag with M balls using a balance just one time?

Design Patterns : C++ full collection of Gamma's patterns

Full collection of Gamma's patterns in c++:
  1. Creational: Abstract Factory, Builder, Factory, Prototype, Object Pool, Singleton,
  2. Structural: Adapter, Bridge, Composite, Decorator, Facade, Flyweight, Proxy patterns,
  3. Behavioral: Chain of Responsibility, Command, Iterator, Mediator, Memento, Observer, State, Strategy, Template, Visitor, and Interpreter

Sunday, January 25, 2009

Better than logarithmic ?

You have 8 balls: 7 have the same weight, 1 has a different weight.

Can you identify the heavier ball by using a scale balance just two times?

Saturday, January 24, 2009

Friday, January 23, 2009

Magy! how can you identify adult images?

Magy! wants to launch an image search engine, where adult content is filtered.

Can you suggest few techniques for filtering adult images? Please, use your own fantasy ;-)

Thursday, January 22, 2009

Design Patterns: C++ Behavioral Design Patterns (part one)

I continue my sum-up of design patterns. This time an initial collection of Behavioral Design Patterns in C++. Namely, Chain of Responsibility, Command, Iterator, and Mediator.

Wednesday, January 21, 2009

Bank system: better to conquer or to defend?

Bank system is having troubles. You are given a collection of K banks. Each bank has a certain amount of money (Mi). A bank i can decide to buy a bank j if its capital Mi is greater than the half of Mj (Mi > Mj/2). When a bank is controlled, its capital can be used by the controller.

A good CEO should decide what strategy to adopt for enlarging its control on other banks, and avoid to be controlled by other banks.
  • Can you describe an (heuristic) algorithm for help the CEO?
  • Given a solution, can you identify the best CEO (different criteria can be given: who has control on the largest number of banks, who has the largest amount of money, etc)?

Tuesday, January 20, 2009

Bill Gates ' paper on sort

Bill Gates published one academic paper on sort. You are given an array A of unsorted integers and the only allowed operation is to "flip" a sub-array. A flip(i) is an operation which reverses all the sub-array starting in position A[i]. Namely,

(A[1], .... A[i-1], A[i], A[i+1] ...., A[n-1], A[n])

is changed into

(A[1], .... A[i-1], A[n], A[n-1] ...., A[i+1], A[i])

Can you suggest a sorting algorithm based just on flip() operations?

PS: this sorting method has an impact on genetic studies and evolution of species.

PS: the paper is W. Gates and C. Papadimitriou, Bounds for sorting by prefix reversal (1976)

Monday, January 19, 2009

Design Patterns: C++ Structural patterns

I continue my sum-up of Design Patters. This time you have a collection of structural pattern skeletons.. Namely, Adapter, Bridge, Composite, Decorator, Facade, Flyweight and Proxy patterns.

Sunday, January 18, 2009

Design Patterns : C++ creational patterns

Design Patterns are the swiss knife of any programmer. I read many times the book of the gang of four. Yes it is a seminal book, but there are too many words there and too little C++ code.

Therefore, I wrote sum up the patterns in c++ skeletons. Here you have the c++ skeletons for creational design patterns. Examples of Abstract Factory, Builder, Factory, Prototype, Object Pool, Singleton are given.

Saturday, January 17, 2009

Dreaming a DARPA-like Obama plan for energy

I dreamed of a DARPA-like plan for energy. Obama has a great asset in Internet based technology companies. He can use this asset as one of the pillar for re-vitalizing the economy.

I hope that the other pillar will built as a result of a massive investment in energy innovation. I dream of seeing dozens of companies such as Tesla.

Data parallel multi-core C++ computations

Threading Building Blocks is a collection of generic containers and algorithms for performing data parallel multi-core computations. The key ideas is that the programmer should not deal with low-level thread and syncronization issues, but with an higher data-parallel paradigm. This reminds me somehow the old-time, old-school skeleton parallel computation, where the language gives a fixed number of parallel constructs which are optimized by the compiler.

TBB has a rich set of containers such as concurrent_vector, concurrent_queue, concurrent_hash_map, and data-parallel algorithms such as parallel_for, parallel_sort, parallel_while, parallel_scan, parallel_reduce, and compositions such as pipeline and task (see Tutorial | Reference Manual | Code Samples). I am playing with the library and I found it versatile enough.

C++ Multithread concurrent programming

Boost is my favorite choice and I suggest you to read the code of BOOST_THREAD class because it is very instructive. In version 1.36/37 there are some improvements. Here, I report the ones I like the most

1. You can create a thread and pass arguments to it (internally it uses boost::bind) as in
int k;
void thread_function (int i, std::string s, int& j);
boost::thread t(thread_function, 10,"hello world", boost::ref(k));

Note that this is making a copy of the function, so any reference should be passed via boost::ref().

2. You can have a multiple-readers/one-writer lock which is useful in many situations, just have a look to shared_mutex class. Another useful features is boost::mutex::scoped_lock (which was already available in previous versions). It implements the RAII concept and therefore when the lock variable is out-of-scope, it is automatically unlocked.

3. You can try to lock up to five resources and if one of them is locked all the others are released. This is useful to avoid deadlocks. Have a look to try_lock()

4. Interruptions for threads

Other improvements are described here. Here you find a code that uses a number of these features.

Friday, January 16, 2009

Computing efficiently the Erdős numbers

From wikipedia: "In order to be assigned an Erdős number, an author must co-write a mathematical paper with an author with a finite Erdős number. Paul Erdős is the one person having an Erdős number of zero. If the lowest Erdős number of a coauthor is k, then the author's Erdős number is k + 1."
  • You are given a collection of K papers, can you write an efficient algorithm for computing the ErdÅ‘s number associated to each author?

Thursday, January 15, 2009

Rapa nui and the mohais

The inhabitants of the remote Rapa Nui island want to build a pyramid of mohais. There are K mohais in the island and each of them has a weight wi and a strength si. The i-th mohai can tolerate a weight no larger than si on his shoulders. What is the maximum height of the pyramid? (PS: there are two options either you subtract the weight of the mohai from his strength, or you don't subtract it).

Wednesday, January 14, 2009

a rant to the academic community

I like to serve conferences as PC member or to review journal papers. Anyway, I keep seeing papers with elegant theoretical solutions and very poor experimental settings. For instance, you can see a new elegant text clustering algorithm tested in an unrealistic environment such as "Let's suppose we take this collection of 4000 articles and cluster them in just 5 clusters, well then my algorithm is better than the state-of-art of about 3% in precision".

Please, do not accept this type of papers. Web reality is different you have hundred thousands or millions or more documents and, certainly, you don't have just 5 clusters. Plus, data is evolving and you may want to consider temporal constrains, as well.

Tuesday, January 13, 2009

Size of the web: how many servers to store the whole web?

Back in 2005 someone said that the web is more than 11.5 billion of pages. Maybe it is a bit larger now.

  • Can you estimate what is the amount of disk storage for storing it?
  • Can you estimate what is the amount of disk storage for storing different snapshots (see thewaybackmachine)

Magy!: how can we monitors most frequent queries?

Magy! receives a huge amount of queries everyday. They don't want to store all of them, since many of the queries are unique (submitted just one time by just one user). Anyway, for caching reason they want to understand what are the queries whose frequency exceed s=0.1%. How do you solve the problem? Remember we don't want to store all the queries.

Monday, January 12, 2009

Magy!: how many servers for a video search engine?

Magy! is the ultimate search engine. It won all the battles and survived. They have an ultra-social video search engine, which receives 500 new videos per second. Videos are user-contributed.
  • Can estimate what is the disk space needed to store one year of videos?
  • Can you propose a server infrastructure for storing the videos?
  • What about caching.
Yet another back of the envelope computation.

Sunday, January 11, 2009

Random samples and permutations (which terminate)

I am enjoying Jon Bentley's book. And well... I always made the same mistake he made for generating random samples and permutations (see Programming pearls: a sample of brilliance). Floyd's algorithms are pretty and elegant solution to the problem. Here you have a C++, STL implementation.

Saturday, January 10, 2009

Back of the envelope computation: how large should be an hash table?

Jon Bentley's More Programming Pearls: Confessions of a Code has a nice chapter about back of the envelope computations. It is amazing how many people are having difficulties with simple math, while they know about algorithms and design patterns. So here is one simple problem:
  • You have a collection of 50Millions words and you want to hash them into an hash table. How large the hash table should be?
More problems to follow.

Problem solving: bridges

Four guys are on the left side of a bridge and they need to go to the right side. It is night and the guys have just one electric torch light. Each guy has a different walking time. Namely, 10 second, 5 second, 2 second, and 1 second. At most 2 guys can be on the bridge at the same time, and when they walk in pair they have the speed of the slowest guy.

What it the minimum time to have the four guys on the right side of the bridge?

Friday, January 9, 2009

Problem solving: Light bulbs and switches

In a R1 room there are three switches and in another separate room R2 there are three light bulbs. The three switches control the light bulbs, but you cannot see the lights in R2 from R1.

You need to discover what switch controls each bulb, but you can enter in the room R2 only one time.

Generic Skip list (skiplist)

Skip lists are an interesting randomize data structure for storing pairs of . Skip lists have logaritmic search and insertion time. Here I sketch a generic skip list code

Thursday, January 8, 2009

Problem solving: Array of bits, and their state

Suppose to have an array of k bits. At step 1 all the bits are set to 1, at step 2 all the even bits change their state, at step 3 all the bits which are multiple of 3 change their state, at step k all the bits which are a multiple of k change their state. How many bits will be set to 1 at step k?

There is a simple quadratic algorithm, but it turns out that there is a closed formula to compute the number of bits set to 1 at step k.

Tuesday, January 6, 2009

Bits, endianess, and stack growing

Here some little bit oriented tricks.

bool endianess(){
int test_num = 1;
char *ptr = (char *) &test_num; // this will return a byte
return (*ptr)
}
  • How to count the number of bits set to 1 in the internal representation of a number?
int num_ones_in_binary_representation(int num){
int num_ones = 0;
while (num){
if (num & 1)
num_ones++;
num = num >> 1;
}
return num_ones;
}
A more elegant solution:
int num_ones_in_binary_representation(int num){
int num_ones = 0;
while (num){
num = num & (num - 1); // consider the binary representation of num - 1 and subtract them
num_ones++
}
}
  • Is your stack growing up or down?
Call two functions and take the address of their arguments, or consider the address of two consecutive variables within a function.

Generic Graph search

Two classical graph search algorithms. DFS and BFS.

Sunday, January 4, 2009

Generic Graph

Another generic data structure. This time a labeled direct graph. Labels can be associated to nodes and to edges with different types. Nodes are stored in an STL vector and new nodes can be dynamically added to the collection. Edges are stored in STL vectors, one vector for each different node.

Saturday, January 3, 2009

Generic list and list iterators

Iterators are a nice paradigm for traversing a collection of elements. In C++ they can easily be expressed by using suitable struct contained within a class. STL's iterators are implemented this way (It is a useful reading to study the code in your distribution). Here I present a generic list implementation which supports list iterators.

Thursday, January 1, 2009

Generic heap

I believe that heap is one of the most fascinating basic data structure. It's simple and yet elegant. Here you have a generic implementation, which also provides generic heapsort.