Friday, October 30, 2009

How do you evaluate a search engine performance?

Cumulated Gain Based Indicators of IR performances is a very interesting survey about the topic. I would suggest reading it to anyone interested in search engines' metrics

Thursday, October 29, 2009

a Facebook Fan Page for any site on the web?

OpenGraph is coming. "Ethan Beard, laid out the Open Graph as essentially a Facebook Fan Page for any site on the web. So you can imagine that you might be able to create a Facebook-style Wall to include on your site, but able to update your statuses from your site, leave comments, like items, etc. Again, it’s like a Facebook Page, but it would be on your site. And you can only include elements you want, and leave out others."

Wednesday, October 28, 2009

Twitter Growth (by the way of Alessio)

Alessio started his own blog and first data is really interesting : Twitter receives 26M tweets per day, 22% are URLs.

"According to my most recent studies Twitter currently receives about 26 Million tweets per day. It is impressive, especially if you consider that in January 2009 they were hovering around “only” 2.4 Million daily tweets!"

Alessio is the Director of Search @ OneRiot. See my previous posting.

Tuesday, October 27, 2009

Yahoo goes real time with Oneriot

Ehi Alessio, I am very proud of you my friend (if this rumor is confirmed)

"Techcrunch said Tuesday that Yahoo is planning to partner with OneRiot, which operates a real-time search engine and develops browser add-ons that do pretty much the same thing. The possible deal comes on the heels of separate plans announced by Microsoft and Google last week to integrate Twitter pages into their search results."

Sunday, October 25, 2009

Similarities: what is better?

Similarity functions are the core of many machine learning and data mining algorithms (hmm not just clustering and recomandation systems). There are many sim measures out of there.

What is the best one?

It depends. Anyway cosine similarity has a very good behaviour in a large scale experiment run by Google in the paper "Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network". Other measures were evaluated such as L1-norm, Pointwise Mutual Information, Pointwise Mutual Infomation with negative feedback, TF*IDF, LogOdds. Dataset for the experiment is Orkut and 4,106,050 community pages with recommendations were considered. Cosine measure was the best one in terms of finding correct correlations between recommendations. I am pretty sure that different measures can have different performances for other datasets. Anyway, this is another example of why I love to KISS.

Keep it simple baby, KISS.

Saturday, October 24, 2009

Starting to be interested to AD Research

AD Research is one of the topic that I know the less. So it has been a pleasure to read this post by Muthu @ Google, with a lot of useful tutorials such as ads 101: ad exchanges, real time bidding

Friday, October 23, 2009

Yahoo: Carl Icahn Says His Work Is Done, Resigns From Yahoo’s Board

What would be the forecast for the stock? Hard to say

"It’s hard to determine whether Icahn is throwing in the towel on Bartz or this is actually a vote of confidence. If he really believes in where Bartz can take the company after the search deal is done, then you’d think he’d keep his board seat to have a stronger influence on the company’s direction. But Icahn has always been a transaction-oriented investor. He tries to push companies to do things that will move the stock in a big way, and then he takes his profits and he leaves."


Thursday, October 22, 2009

Twitter: anyone wants to have it.

Bing made a deal; Google made a deal; then there are a tons of real time search engines out of there. Is so cool to be real time these days.

Web was open in the past: you go out of there and you crawl the public content.
Is not longer like it was used to be: Facebook and Twitter are the faster growing part of the Web and their content is not public. They don't need to make it public to attract traffic from search engines, like traditional online newspapers.

So they keep their content private and they sell it. Is this good for the community? I don't think so. Is this good for Facebook and Twitter investors? Very Very much.

Wednesday, October 21, 2009

Facebook numbers, quite impressive I say

How impressive? Schroepfer threw out some huge numbers. Among them:

  • Users spend 8 billion minutes online everyday using Facebook
  • There are some 2 billion pieces of content shared every week on the service
  • Users upload 2 billion photos each month
  • There are over 20 billion photos now on Facebook
  • During peak times, Facebook serves 1.2 million photos a second
  • Yesterday alone, Facebook served 5 billion API calls
  • There are 1.2 million users for every engineer at Facebook
And they are doing this with just php, mysql and memcached ?

nahhhh

Tuesday, October 20, 2009

Little game to make your friends wonder you are a wizard

Ask your math pal to think a math formula, or better a polynomy of quadratic order p(x). Ask her not to disclose it, but to compute p(0), p(1), p(2) ang give to you the results. How can you surprise your pal by guessing the correct p(x)?

Monday, October 19, 2009

Cutting pancakes..

What is the maximum number of pieces into which a pancake can be cut by n straights line, each of which crosses each others?

Sunday, October 18, 2009

Do you need a search book but you don't want to bother with so much math?

I definitevely recomend the new Algorithms of the Intelligent Web. Very concise explanation of all the modern search, data mining, machine learning techniques. Some of the contents in the book:
  • evaluatation: precision, recall, F1, and ROC curves;
  • ranking: PageRank; DocRank; ClickThrough ranking.
  • classification: Naive Bayes, neural networks; decision trees, Bagging; Boosting, etc
  • collaborative filtering; recommendations;
  • clustering: k-means, ROCK, DBSCAN

Monday, October 12, 2009

Random walks for species multiple coextinctions

If a single species is lost, what is the impact on the other species? Is there a cascade effect? A strange answer comes by applying Eingenvector computations in "Googling Food Webs: Can an Eigenvector Measure Species' Importance for Coextinctions?"

Sunday, October 11, 2009

Puzzle game with dice

On average how many times do you need to roll dice before all the six faces appear?

Saturday, October 10, 2009

C++ check list

So, you are making an interview. An well, I'm really upset when you don't know what are the pros and cons for defining a method:
  • operator or non operator
  • free or class memeber
  • virtual or non virtual
  • pure virtual or virtual
  • static or non static
  • const or non-const
  • public, protected, private
And about arguments:
  • return by value, pointer, const pointer, reference, const reference
  • return const or non const
  • passing by value, pointer, reference
  • passing const or non const
And about costructor, destructor, copy
  • private, public, implicit, explicit constructor
  • private, public, implicit, explicit copy operator
  • virtual , non virtual destructor
Do you know exactly how and when to use all the above?

Friday, October 9, 2009

50 Years of C++

ACM Multimedia Center on 50 Years of C++ presented by Bjarne Stroustrup at ACM DC Chapter meeting. A wonderful overview of C++ evolution given by the creator of the language.

Thursday, October 8, 2009

Top world-wide universities

Italy is not in the list. London has many universities in good position.

Wednesday, October 7, 2009

DailyBeast and the power of good business deals

When I was in Ask.com my team co-lead the launch of DailyBeast. One lesson I learnt from that experience: never under-estimate the importance of a good business deal.

DailyBeast received another wonderful prize. That is not just because there is some good technology behind it. That is mostly because they know what is News and they are focused in that business. I believe that technology is neutral and the focus must come from business deals.

Congrats, Tina

very much deserved.

Tuesday, October 6, 2009

Well, you can't do this... Google vs Ask.com

If true it is funny. Auletta relates: "[Diller] said to Larry, 'Is this boring?' 'No. I'm interested. I always do this,' Page said. 'Well, you can't do this.' Diller said. 'Choose.' 'I'll do this,' Page said matter-of-factly, not lifting his eyes from his hand-held device. 'So I talked to [Google co-founder] Sergey [Brin],' Diller said."

Monday, October 5, 2009

Web Random Bits:

My proposal for these problems. What is yours?

1. What about using a Randomness extractor starting and considering Twitter search as an (high-entrophy) source. There are applications like this based on RF-noise, Lavalamp, and similar funny sources. I would adopt either MD5 or SHA-1 hash function. This will give all the benefits of the crypto-hash function and bias-reduction.

2. Synching is a bit more hard, since the query is not enough. Time is another important factor: the results you see at time t are potentially different from the ones seen at time t + \eps. The index can change and the load balance mechanism in search can bring you to a different search cluster. I guess you must relay on some proxy which must capture a snapshot of twitter and then you must syncronize on the query and the time. There are a couple of services like this out of there.

Sunday, October 4, 2009

CopyCat:: Web Random Bits

I found these problems by Muthu (@Google) quite interesting and funny. So I like to repost them:

"We often need random bits. People toss a coin or use noise from transistors or whatever to obtain random bits in reality. Here are two tasks.
  • Devise a method to get random bits from the web. The procedure should not use network phenomenon such as round trip delay time to random hosts etc, but rather should be at application level. Likewise, the procedure should not be to go to a website which tosses coins for you. Rather, it should use some existing web phenomenon or content. Btw, quantifying/discussing potential biases will be nice.
  • Devise a method for two people to coordinate and get public random bits from the web. The method can't assume that two people going to a website simultaneously will see identical content or be able to synchronize perfectly.
Of course, it should be fast and fun."

Saturday, October 3, 2009

Tweetmeme provides a stat service on Twitter

Tweetmeme, has some interesting stat service on Twitter
  • Click-through data
  • Retweet data
  • Trees of retweet reach
  • Potential visibility
  • Influential users
  • Tweet Locations
  • Referring domains
  • User stats
This is useful for dynamic ranking of Twitter's users and postings.

Friday, October 2, 2009

Yummy chocolate table

You are given a chocolate table of mxn unit squares. You want to separate all the unit squares. What is the number of breaks you need? (supppose that when you break the table, you must break either a full column or a full row).

Thursday, October 1, 2009