Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search

Monday, May 24, 2010

Detecting the dominant (>50%) symbol in a stream.

You are given a stream that you cannot hold in memory. At each instant you want to determine the dominant symbol observed in the stream (e.g. appearing more than 50% of times).

A binary heap with max ordering will do the job. For each encountered symbol, the heap will store its frequency and the most frequent symbol would reside in the heap's head.

if its over 50% of the time then on average, the symbol should occur every OTHER position in the stream. i think its pretty easy from there, isn't it? (ps: more than likely i'm completely wrong, i'm a total n00b)

You could maintain a counter (count) of the symbols seen so far. Then you could use an array F of N slots (its size would be N x4bytes or 8 bytes), where N is the size of the dictionary of your symbols (I'm assuming a fixed size dictionary). F[i] will store the number of times the i-th symbol has been seen so far.

At any time you have the frequency of each symbol as F[i]/count and evaluate if there is a dominant symbol (F[i]/count > 0.5).

The problem is also about how to pick up the bigger element in the array F_t/count_t where _t means the configuration of the array F and the value of "count" at time "t".

You could sort it at time t (being N small this should a very fast operation to perform in memory).

Alternatively you could maintain the values of F into an max-heap. At any time you can pik up the most frequent symbol in O(1).

A binary heap with max ordering will do the job. For each encountered symbol, the heap will store its frequency and the most frequent symbol would reside in the heap's head.

ReplyDeleteThe number of symbols can be infinite (e.g. larger than any memory you have in the heap). this one is not a valid solution

ReplyDeleteif its over 50% of the time then on average, the symbol should occur every OTHER position in the stream. i think its pretty easy from there, isn't it? (ps: more than likely i'm completely wrong, i'm a total n00b)

ReplyDeleteI am aware of COUNT SKETCH DS as described in http://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf

ReplyDeleteIs there a simple solution

You could maintain a counter (count) of the symbols seen so far. Then you could use an array F of N slots (its size would be N x4bytes or 8 bytes), where N is the size of the dictionary of your symbols (I'm assuming a fixed size dictionary). F[i] will store the number of times the i-th symbol has been seen so far.

ReplyDeleteAt any time you have the frequency of each symbol as F[i]/count and evaluate if there is a dominant symbol (F[i]/count > 0.5).

The problem is also about how to pick up the bigger element in the array F_t/count_t where _t means the configuration of the array F and the value of "count" at time "t".

You could sort it at time t (being N small this should a very fast operation to perform in memory).

Alternatively you could maintain the values of F into an max-heap. At any time you can pik up the most frequent symbol in O(1).

That's all... i'm missing something?