Thursday, June 25, 2009

Stephen Wolfram talking about Wolfram|Alpha at Pisa

Stephen Wolfram gave a public talk in Pisa about Wolfram|Alpha, his latest research pillar. This was the first public talk he gave world wide after the product launch, about 5 weeks ago. He already gave other talks to restricted audiences.

I liked the approach he adopted for presentation. He went straight to the point giving a demostration of the innovative aspects of Wolfram|Alpha.

He started with queries related to his research activities, such as Mathematica, Nks, Computational Knowledge Engine, maths. Then, he showed some aspects of natural language queries such as "integrate cosx dx", "find the integral of six x cos y". This was a first step behind the traditional search engine's kingdom.

I believe the audience was impressed by that and I heard people asking "is this just a nice demo or can we move to other domains as well?" Stephen, gave an impressive answer with queries like gdp france (economy), geoip (internet) and with simple but effective computations on the top of this data, such as "what is the gpd of spain?" , "gdp france/ italy", "italy internet users", Pisa, Lexigton and Pisa Lexigton (where the engine is assuming a travel intent), sun and many others.

Then, he moved to more sophisticate forms of computations, such as "jupiter pisa dec 10 1608", a tribute to Galileo who was born in Pisa and observed Jupiter, and other metric computations such as "5 miles/sec", or chemical computation such as "pentane 2 atm 400 centigrade", or genetics such as "aggttgagaggatt" (with an impressive approximate pattern matching technology), or finance "msft apple" (with a nice comparison among the two stocks), or nutrition with apple (where the meaning is disambiguated between the stock and the fruit) and nice nutrition computation such as "potassium 3 apple + 1 cheddar cheese", or personal finance computation "5% mortgage 50000 euros 20 years". I liked very much when the computational engine tried to guess mathematical formulae like "1 1 2 3 5", the famous Fibonacci numbers another tribute to another famous man born in Pisa.

After this initial demo, Stephen discussed at a very high level the 4 foundamental pieces composing Wolfram|Alpha.
  • A very large number of real time data feeds, which are continuously cleaned-up by human editors who are expert in every distinct domain. Stephen provided no precise number about how many people are involved in this cleansing step;
  • A computational engine used to map the cleansed real data feed into some internal representation. Stephen said that any information is mapped into "short fragments", which are not necessarly NPL units.
  • A query engine which maps the user queries into the internal representation. He claimed that currently they have about 22% of queries which show no matching answers.
  • A collection of ranking algorithms for selecting the best answer among the potential ones;
He said the the whole code is made up of ~6 Million lines of Mathematica code, and that more than 50K algorithms have been implemented. Unlikely, no other scientific information has been provided.

Here a collection of questions and a synthesis of the answers.

Q: how do you position W|A compared with the rest of NLP technology?
A: The problems we face are quite different from traditional NLP ones. We don't have well-strucured documents, we don't have well formed synthatical sentences with subject, verb, and so on. We adopted Wikipedia quite estensively for understanding entities and pre-aggregated data. Anyway, apart from that, Wikipedia is of a little use, even infoboxes are not good from the quality point of view . One thing that we want to investigate is to let user updload their own data to our engine, and to perfom some computation of top of that data. One additional research field that we are investigating is what we called "minimum preserving transformation", a methodology to map different, but semantically similar queries, into out internal fragment based representation.

Q: What are the resources involved in W|A in terms of people and servers?
A: We have 4 different co-location (data-centers). Every single request goes to 8 CPUs in parallel, at the very beginning we started with about 10.000 servers, now we are increasing that number.
For each request, we have a starting serving time of about 5ms when the first result is sent to the client. Then other results are injected with AJAX technology. 100 people worked on the project for about 3 years. Now some of them returned to our core Mathematica developement, but we are doing a massive hiring campain. We are now starting to mine our query logs and there is a huge amount of information there. People wants to search and not just to test knowledge

Q: How important is the caching strategy?
A: Our queries are quite different from the ones provided by search engine, where traditionally at least 25% of queries can be cached. We have a rather low percentage, almost zero. One thing that is very promising is use past succesful queries to suggest related queries to new users.

Q: Can you search Wolfram?
(vanity query worked quite well)

Q: What about a deal with Google or other engines?
A: Future is pretty interesting and we have nice relations with media and news. A bunch of new things will come next months...

Q: Do you think that Wolfram|Alpha will have a negative impact on homework? with people lazy about studying?
A: Any new technology has the same questions. I believe we had similar questions when the librarians came or when we saw CDs with encyclopedia, or with Wikipedia. Actually, I strongly believe that Wolfram|Alpha can encorauge people to study more and more.

Q: I am quite interested in the type of queries and questions you receive. How many of them are about facts which are happening right now? For instance, iran election
A: A large number of the queries we get are about real time data. We are investigating this sector. Potentially, we need a large number of feeds, clean them in real time and perform a real time computation. This is interesting and we get a lot of queries like this, even if our query log analysis is just 5 weeks old.
(disclaimer: this above was my own question, I notice that W|A has poor perfomance on facts that are happening right now, or just few weeks ago. Good they want to address the issue)

Q: How do you see the future of W|A? how difficult is to run a project like this?
A: Wolfram is an independent company and we have the money and the crazyness to throw away hundreds of millions in a project that had at beginning no future at all. I wanted to answer a question: "With current technology, is an computation engine feasible?" At the beginning I thought "No way", but I invested the money and after two years I said "Maybe", one year ago I started to think "Yes it is". Many things are necessary to drive a projects like this. You need the Web with all the feeds and the information you can get from there. You need the crazyness and you need to power to take decision in freedom, with a limited number of people involved in the decision. Basically, you need to drive it. Today, we are just at time t + 5 weeks. Answer is Yes, anyway.

Q: Do you think that this technology can be used to make guesses?
A: What do you mean?
Q: I mean humans are used to reason and make guesses. Suppose that someone is asking me "Is Los Angeles larger than Tallassee?" I never heard about Talassee, but I've heard about Los Angeles many times. Therefore, I would use the frequency of the name has an indicator of how large is the city. I would use a different measure to infer something that I don't know.
A: (Stephen started to mumble... then he continued to mumble and I saw some computations drawing on his face). Than he said "Quite interesting question, I never thought about it, I guess you are referring to rules of thumbs. So my question is how many rules are out of there? I will think about it. We have some initial guess when we try to infer mathematical formulas like in
"13 5445656 32" or when you ask for things like "5000 words". Personally, I believe that "Human reason is great, but science is better".

Q: Do you believe that W|A can answer to philosophical question? Like the ultimate question: "Does god exist?" He asked the question and got this answer:

"I'm sorry, but I don't think a poor computational knowledge engine, no matter how powerful, is capable of providing a simple answer to that question.
"

(Just after that his phone rang. I tought he received a call from above!). Was this a prepared question? Maybe, I don't know.

So the conclusion. The presentation was quite interesting and we are definitevely in front of something new. When W|A gives an answer, it is generally quite impressive and you cannot stop to play with it. The precision is good but the recall is somehow quite low. They have a low coverage within certain specific domains. Anyway, we are just at "t+5 weeks", as Stephen said. Therefore, it's too early to express a definitive judgment.

I can say that computation is quite effective when you are navigating trough specific domains such as Maths, Physics, Nutrition, Geography, Finance, and another bunch of them. There are domains where they have no data. Hence, no computation at all. And as Stephen said they just know about English, and no other language at the moment.

I have a request for W|A team, I understand the need to preserve the IP of what you are doing with patents and the like. Anyway, I hope that you will publish a little more about your results in scientific publications like all the other big engines are doing (e.g. Google Research, Microsoft Research, and Yahoo Research). I hope that you are not going in the direction to keep all the industrial results as secret ones. I remeber, I heard the term knowledge engine before in 98 and it is no longer there. I believe that this was because they decided to adopt a rather obscure way of describing their technology (personal opinion).

W|A is a quite different story, I want to see it evolving and opening to the rest of the world. I will keep monintoring your results.

1 comment:

  1. see some other information about W|A here http://www.technologyreview.com/printer_friendly_article.aspx?id=22834&channel=web&section

    ReplyDelete