Monday, November 30, 2009
Pretty elegant application of Karush–Kuhn–Tucker conditions
Sunday, November 29, 2009
"Unlike Bing and Yahoo, Google does not have a permanent left hand sidebar with additional links for more narrow searches. Instead there is a link at the top of the page called “Show options”.
Click on it and Google will add a sidebar which helps you refine your search query. You may, for instance, limit your search to new pages from the last hour.
Search Engine Land reports that Google will change it’s search result pages next year and give them a more coherent look and feel.
Most importantly: It seems the sidebar will become a permanent feature on all search result pages.
The sidebar will include links to Images, News, Books, Maps and “More”, as well as related searches and links that let you limit the search to a specific time period.
Google will give you the alternatives (or “modes”) it thinks is most relevant to your search.
Ask.com launched search result pages like this in 2007. Because of this Ask.com became one of our favorite search engines. Ask later abandoned its “3D” search in order to become more like Google!"
Saturday, November 28, 2009
Friday, November 27, 2009
Thursday, November 26, 2009
1. how do you partition the documents?
2. what is "good"?
3. what is "enough"?
Please formalize your answers.
Wednesday, November 25, 2009
Facebook claims that they have more than 300 Million of users world wide. I sampled their ads user database and found the following geographical users' distribution:
- 34.32% are in U.S.
- 8.15% are in U.K.
- 5.05% are in France
- 4.99% are in Canada
- 4.50% are in Italy
- 4.42% are in Indonesia
- 2.65% are in Spain
Tuesday, November 24, 2009
Monday, November 23, 2009
Sunday, November 22, 2009
- What is the probability for team i to win on team j?
- What is the probability of the whole season (each team plays against the remaining ones)?
- Find an algorithm to rank the teams
Saturday, November 21, 2009
- Different training sets are generated from the N objects in the original training set, by using a bootstrap procedure which randomly samples the same example multiple times. Each sample generate a different tree and all the trees are seen as a forest;
- The random trees classifier takes the input feature vector, classifies it with every tree in the forest, and outputs the class label that recieved the majority of “votes”.
- Each node of each tree is trained on a random subset of the variables. The size of this set is a training parameter (in general sqrt(#features)). The best split criterium is chosen just considering the random sampled variables;
- Due to the above random selection, some training elements are left out for evaluation. In particular, for each left-out vector, find the the class that has got the majority of votes in the trees and compare it to the ground-truth response.
- Classification error estimate is computed as ratio of number of misclassified left-out vectors to all the vectors in the original data.
Friday, November 20, 2009
It is true that Internet has drastically grown in the past few years and has become more complex, but Search Engines are still on the verge of evolution. In order to make search engines more reliable information resource for users, Microsoft launched Bing in June, 2009.
Bing was launched under Beta tag in the UK. Microsoft at that time promised to remove the tag only under one condition i.e if its experience would be different from the competition and if the results would be outperforming in terms of UK relevancy.
The Bing team reached its objective on November 12, 2009 and the credit goes to London-based Search Technology Center. Microsoft says that 60 engineers behind the project in Soho have done extensive job at localizing the Bing global experience for the UK users in just 5 months.
Thursday, November 19, 2009
This technology has been adopted by Fast and later on by Yahoo.
Wednesday, November 18, 2009
Tuesday, November 17, 2009
- During October, Bing represented 9.9% of the market, up from 9.4% in September, according to comScore.
- Yahoo got slammed, losing almost a full percentage point of the market, to 18.0%, down from 18.8% in September.
- Google gained a bit of share, to 65.4% in October, up from 64.9% in September.
Total search volume increased 13.2% in October, below 17.3% growth in September.
More information here
Monday, November 16, 2009
More information here
Sunday, November 15, 2009
Saturday, November 14, 2009
"Experimental results show thatthe methods based on direct optimization of evaluation measure scan always outperform conventional methods of Ranking SVM andRankBoost. However, no significant difference exists among the performances of the direct optimization methods themselves."
In this case, my preference goes to AdaRank for its semplicity and clear understanding of the key intuitions behind it.
Friday, November 13, 2009
- Rule Based (production rules, decision tree, boosting DT, Random Forest)
- Neural Network
- Support Vector Machine
- Distance Based
Classifiers are generally combined in an Ensembles of classifiers. Many of the above methods are implemented in OpenCV
Thursday, November 12, 2009
Wednesday, November 11, 2009
I am trying to understand a bit more about the language. Garbage collection is there, type inference is there, lamba/closure is there, but where are modern things like generic/collections and exceptions?
Generic is a commonly accepted programming paradigm that any modern programmer is using (C++ has it, Java has it, etc, etc?)
BTW, there was already a programming language called Go and google missed it ?
Tuesday, November 10, 2009
Monday, November 9, 2009
Some DBSCAN advantages:
- DBScan does not require you to know the number of clusters in the data a priori, as opposed to k-means.
- DBScan can find arbitrarily shaped clusters.
- DBScan has a notion of noise.
- DBScan requires just two parameters and is mostly insensitive to the ordering of the points in the database
- DBScan needs to materialize the distance matrix for finding the neighbords. It has a complexity of O((n2-n)/2) since only an upper matrix is needed. Within the distance matrix the nearest neighbors can be detected by selecting a tuple with minimums functions over the rows and columns. Databases solve the neighborhood problem with indexes specifically designed for this type of application. For large scale applications, you cannot afford to materialize the distance matrix
- Finding neighbords is an operation based on distance (generally the Euclidean distance) and the algorithm may find the curse of dimensionality problem
Here you have a DBSCAN code implemented in C++, boost and stl
Sunday, November 8, 2009
Saturday, November 7, 2009
Friday, November 6, 2009
Thursday, November 5, 2009
- Distance euclidean d
- Distance cosine d
Here the code
Wednesday, November 4, 2009
Tuesday, November 3, 2009
Inheriting is a great thing, but sometime you don't want to pay the overhead of a virtual method. I mean, anytime you overide a method the compiler will add a new entry in the virtual table and at run-time a pointer must be deferenced. When performance is crucial or when you call a method several times in your code you may want to save this cost.
In these situations, Modern C++ programmers prefer to adop a kind of compile-time variant of the strategy pattern. At compile time, the appropriate class and method is called during template instatiation. In this sense, Policy-based design is very useful when you want to save the cost of the virtual table.
In this example, I wrote a distance class and a method distance which is potentially used very frequently by any clustering algorithm. The method can be either a EuclideanDistance, or any kind of different distance. And there is no need of any additional virtual table, every choice is made at compile time statically.
Here you find the code.
Monday, November 2, 2009
Sunday, November 1, 2009
1. Microsoft's (MSFT) new Bing search engine picked up 1.5 percentage points of market share in August to hit 9.5%, according to market researcher Hitwise, while Google's share fell from 71.4% to 70.2%.
2. But longer term, Twitter, Facebook, and related services may pose a more fundamental threat to Google: a new center of the Internet universe outside of search. Twitter, now with 55 million monthly visitors, and Facebook, with 300 million, hint at an emerging Web in which people don't merely read or watch material but communicate, collaborate with colleagues, and otherwise get things done using online services.
3. Meanwhile, Google's very success and size are starting to work against it. In the past year the company has been the target of three U.S. antitrust inquiries and one in Italy. Most recently the Justice Dept. on Sept. 18 said Google's controversial settlement with authors and publishers, which would have allowed it to scan and sell certain books, must be changed to avoid breaking antitrust laws. Even Google's own paying customers—advertisers and ad agencies—say they're eager for alternatives to blunt Google's power. Says Roger Barnette, president of search marketing firm SearchIgnite: "People want a No. 2 that has heft and scale."
4. Most of the search quality group's contributions are less visible because its work is focused mostly on the underlying algorithms, the mathematical formulas that determine which results appear in response to a particular query. Google conducts some 5,000 experiments annually on those formulas and makes up to 500 changes a year