Monday, October 31, 2011

Playing with languages, part I

Sometime ago I asked to my friend Charisma why she was studying Italian and she said that she wanted to learn Spanish. I found the answer somehow confusing because the two languages are different. The point is that I missed to understand the main point ;-) You can actually learn and improve a language  if you study different languages.

These days I am enjoying reading  "Seven languages in seven weeks" a book which compares  Ruby, Io, Prolog, Scala, Erlang, Clojure, Haskell. You can learn things such as Concurrency based on Actors in Ruby, or mixing OO and Functional programming in Scala, higher order programming in Haskel, logic programming in Prolog, having a language based on prototype patterns and messages as in Io, currying and clojures and many other things. Definitively a recommended reading.

The more I read, the more I understand and improve my C++. So Charisma you are right, you can actually study Italian for improving Spanish. It took me a while to understand that.

Sunday, October 30, 2011

Can we use Facebook messages to predict how the stock market is going?

This would be an interesting Machine Learning problem on a massive dataset. Can we use Facebook's messages to predict how the stock and bond market will evolve? Here a couple of examples of the raw dataset.

Web server performance

A web server W is running a service S which provides information I when it receive queries Q. How can you guarantee a time < s for the 95 percentile?

Friday, October 28, 2011

Two colors

Given a undirect graph G=(N, E) a two colour scheme assigns to each node either the colour black or white. A node is diverse if it has the opposite colour of at least half of its neighbours. How can you assign a diverse colour to all the nodes in a graph?

tricky one

Wednesday, October 26, 2011

Assign employees to bosses

You are given a list of employees and a list of potential bosses. Each employee expresses his ranked list of bosses' preferences. Each boss expresses is ranked list of employees' preferences. Match the two classes so that the maximum number of preferences is satisfied.

Tuesday, October 25, 2011

Got assigned a new patent: how to bootstrap a learning to rank system with real traffic information

Just got assigned a patent filed back in 2005, "Sampling internet user traffic to improve search results". The problem we were trying to address was about bootstrapping a learning to rank system in absence of user search information. This is a typical problem you have when you are not the incumbent search engine, and you don't have already accumulated usage information and user behaviour activities (see more information here Calculating Search Rankings with User Web Traffic Data). How can you compensate this bootstrap situation?

The key intuition was to use web traffic information collected by the way web proxies  and observing user traffic and navigational information, including traffic performed querying other search engines. A similar traffic can be observed by minging a collection of web logs for the HTTP_Referer tag,  The methodology was used for improving the freshness, the coverage, the ranking and the clustering of search engine results and, more generically,  may include monitoring web traffic on remote web servers on the communications network

Sunday, October 23, 2011

Friday, October 21, 2011

Find a way to detect whether two dna strings are similar

One way of doing this is to find the longest common subsequence, since we can allow to have to strings that are also having different genes interleaved. So suppose you have {ACGGAGGGAA} and {ACGAAGG} what is the longest common subsequence? {ACGAGG}

Solution: either direct DP or reducing this to Longest non decreasing subsequence problem.

Thursday, October 20, 2011

Find the longest non decreasing subsequence

Solution: s_i = max ( max (s_j + 1), 1) for j < i and A[j] <  A[i]. So take an already available longest non decreasing subsequence and if you find a subsequent A[i] > A[j] then sum 1. Remember that the minimum is 1

Wednesday, October 19, 2011

Sort a huge set of log files in C++ and STL

you are given a huge set of log files sorted by timestamp and you need to merge them. You have unlimited disk, but fixed amount of memory. In case of ties, break it by consider a progressive number for the log files

Solution: 

good to recall how a min-heap is defined in STL, it's just an heap with a custom comparison. An heap is nothing more that a normal std::vector and you build it with std::make_heap, std::push_back, std::push_heap(), std::front(), std::pop_heap(). It's always a bit confusing for me to consider that you need to push_back in the vector and than push_heap, but that's not so strange considering that an heap is built upon the vector itself ;-)

   bool greater (const std::pair & a, const std::pair & b) 

Tuesday, October 18, 2011

Compute the running median and the running average

You are given a very long stream and you need to compute the running median for each group of 1k numbers. Same question for the running median. What is the complexity?

Suppose you know a priori the number of elements in the stream.

Solution:
given a running average, you can update the group by adding a new element and removing the oldest one. This will give you a solution that is linear in the number of elements in the stream

Similar intuition is for the median, you can use a balanced BST where every node has the number of nodes contained in the subtree rooted at it.

Monday, October 17, 2011

A feature suggestion for facebook: Social calendar

Nothing is more social than our calendar. By definition you invite your colleagues and friends to meet and discuss. So why not integrating a calendar in Facebook, I bet that this would be the 3rd killer app after News and Images.

PS: I love Microsoft, but this idea is simple and needed so I wonder Should I create a startup? ;)

Sunday, October 16, 2011

Winner takes it all

Given a set of 128 players, suppose that x "beats" y has the transitive property. How many games do you need to find the winner? and how many games for finding the second one?

Saturday, October 15, 2011

Friday, October 14, 2011

given a vector v generate all the subsets of v with no repetition

useful for a quick test of your STL knowledge. Then, try the same with repetition

Thursday, October 13, 2011

Social graph: create the links

You are given a snapshot of the social graph G = (N, E, t) at time t. Each node has a set of attributes (a1, a2, ... an). Three problems:

1. Create an edge if two nodes have exactly the same set of attributes
2. Create an edge if two nodes have similar attributes
3. What if a set of weights it's known a-priori for the attributes?

Wednesday, October 12, 2011

Where are my Friends? Cool Ios5 app similar to this experiment.

Apple just released a cool app into the new Ios5 which is based on the location provided by your phone. This is somehow similar to some experiments I made with Facebook location and Bing Maps. In my case, your facebook friends are mapped on a Bing map according to their Facebook location. In any case, mine was just  an experiment and nothing more

http://www.headtagger.com/WhereAreMyFriends/map.html

This is where my Facebook friends are located in the world


Here my friends in Europe

These are my friends in San Francisco area


And this is a view of a friend of mine in New York City


Here is the Apple's preview


Tuesday, October 11, 2011

Maximum span in an array

Given an array A of numbers, find a pair of numbers A[i] and A[j], such that A[i] < A[j] and j-i is maximized.

solution: There is a trivial O(N^2) solution and a tricky O(N) based on a very simple intuition, where is the minimum?

Saturday, October 8, 2011

Find the missing number

Given a db with about 300Millions social numbers where each number is  9 digits find a number that is missing. You have 2MB of Ram and unlimited disk

Solution
300Millions = 300 * 10^6 = 3 * 10^8

Suppose we take 10 counters and "hash" each number considering only the highest order digit, incrementing the counter if a given number has that starting digit.  One missing number would be located by the counter with the lowest value. That is the intuition, we can now use a bitmap which fits 2MB of Ram and apply the same principle.

Thursday, October 6, 2011

Write a function for computing next permutation of a string

This is a little pearl piece of code

void printVector(int * value, int N)
{
 for (int i = 0; i < N; i++)
  std::cout << value[i] << " " ;

 std::cout << std::endl;
}

void visit(int * value, int N, int K)
{
 static int level = -1;

 level++; value[K] = level;    // the level is current i 

 if (level == N)               // a complete permutation
  printVector(value, N);
 else
  for (int i = 0; i < N; i++)  // 0... N-1
   if (value[i] == 0)             // not used before
    visit(value, N, i);

 level--; value[K] = 0;   // nothing else to explore for the level, ready for next round of usage
}

The above is cooler than the classical recursive solution
A permutation of 1,2,3,4 would be:
1 + permutation_of(2,3,4)
2 + permutation_of(1,3,4)
3 + permutation_of(2,1,4)
4 + permutation_of(2,3,1)

Wednesday, October 5, 2011

Intervals

Given a list of non overlapping intervals, design a data structure for inserting, finding and deleting them (e.g. [10, 30), [35, 50), [70, 100) ...)

Tuesday, October 4, 2011

Feature suggestion: Do we need a new search engine? Certainly not, but Facebook may offer a new definition of search

Bing, Yahoo and Google are already offering a complete search experience and there is probably no need of having a new search engine. Certainly, we already have a fast growing set of new engines for emerging markets such as Yandex in Russia and Baidu in China, but that's because they offer specific expertise in those markets (even if I would be not surprised of seeing them coming more and more in western countries).

Anyway, there is something new and Facebook has a unique position for offering a new set of search services, which could be integrated within the Facebook's popular notification system. Let's start from some examples:

I search for new software development positions in london. Facebook has all the public postings made by their social users. The data is there but the ranking is very poor. Plus this type of search is buried and you need to click 3 different links for accessing it.


I am looking for a musical in London, first result is totally irrelevant even if they have the location of the people posting, a full access to their internal social graph, a full access to my position in the graph, the number of like within facebook and in any external web site embedding the like button, many forms of temporal information, the links you shared, the title and the caption you changed, user importance, how many time I've interacted with my friends, any picture I have changed from automatic crawling,  and a lot more (wow, not so bad). Many of those information are not necessarily available to other search engines



Let's see other examples


Let's suppose we want to listen some music


Or get some news


I assume that you understood that the data is there, but a quality ranking is not.  You may say that this is nothing new because we already had some interest in the past for real time search, and Bing is already including social likes information.

True. Anyway, Facebook as already a quite popular Notification system -- you know that white number on a red background -- which is updated anytime something new is happening in your social graph. So why not allowing myself to register for a new (and relevant please!) update for a query term that I've searched before? That would be a useful feature to have and a new definition of search where results are pushed to me instead of searching them again and again. Yes, I know I am lazy and I like when software is making something on my behalf because I save time for something else.

Monday, October 3, 2011

Fist lessons for Stanford ML couse

http://www.ml-class.org/course/video/list

So far, it's pretty good and I'd recommend attending it.

Given two vectors of integer, merge them in place

Given two vectors of integer, merge them in place (e.g. relocate the second vector to the sum of the vectors and merge the vectors in place).

Here is the code

Sunday, October 2, 2011

Given a large collection of social numbers (e.g. 360Millions)

how to efficiently find a single missing number. You have a very large disk available for storing the number but just 2Mb of Ram

Saturday, October 1, 2011

LCS of very long sequences

Compression is always looking for repetition. Suppose you have two very long files stored in disk. how will you find the LCS with a limited amount of RAM?