Tuesday, June 30, 2009

Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos

Can you detect what is the correct order of a bunch of videos recorded at the same event by different people?

Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos is Yahoo paper is trying to answer this question on a testbed of YouTube videos.

The key idea is use fingerprints based on audio track and then cluster them into a time sequence of videos.

Monday, June 29, 2009

Large Scale Multi-Label Classification via MetaLabeler

SVM can is very effective for binary classification. In order to handle classification with many categories, there are also some extensions for multi-class SVM. Anyway, a more frequently used approach is to adopt a binary classifcation with a one-against all approach (each category is considered against the others and n categories are obtained by levaging n binary classifiers).

Large Scale Multi-Label Classification via MetaLaber suggests extending the one-against all approach by adopting an auxiliar classifier which learns the number of relevant top-k scores.

Sunday, June 28, 2009

Efficient Multiple-Click Models in Web Search

Can you use past clicks on search results to predict future clicks? Efficient Multiple-Click Models in Web Search is a Microsoft paper aiming at providing an answer to this question.

Two different models are evaluated. An Independent Click Model (ICM) where clicks are indipendent each other (i.e. position is not taken into account), and a Dependent Click Model (DCM) where position is taken into account. The models are built on a strong theoretical background based on log-likelihood maximization. Anyway, the final formulas derived are very simple to compute and requires linear time and space and can be computed incrementally.

Saturday, June 27, 2009

Generate all the permutations of a multiset

A multiset has less than n! permutations, since elements can be repeated. Generate all the permutations in optimal space and time.

Friday, June 26, 2009

Michael jackson, i can't stop loving you

Off - topic: Michael, thanks for making so many beautiful moments in my life.

Thursday, June 25, 2009

Stephen Wolfram talking about Wolfram|Alpha at Pisa

Stephen Wolfram gave a public talk in Pisa about Wolfram|Alpha, his latest research pillar. This was the first public talk he gave world wide after the product launch, about 5 weeks ago. He already gave other talks to restricted audiences.

I liked the approach he adopted for presentation. He went straight to the point giving a demostration of the innovative aspects of Wolfram|Alpha.

He started with queries related to his research activities, such as Mathematica, Nks, Computational Knowledge Engine, maths. Then, he showed some aspects of natural language queries such as "integrate cosx dx", "find the integral of six x cos y". This was a first step behind the traditional search engine's kingdom.

I believe the audience was impressed by that and I heard people asking "is this just a nice demo or can we move to other domains as well?" Stephen, gave an impressive answer with queries like gdp france (economy), geoip (internet) and with simple but effective computations on the top of this data, such as "what is the gpd of spain?" , "gdp france/ italy", "italy internet users", Pisa, Lexigton and Pisa Lexigton (where the engine is assuming a travel intent), sun and many others.

Then, he moved to more sophisticate forms of computations, such as "jupiter pisa dec 10 1608", a tribute to Galileo who was born in Pisa and observed Jupiter, and other metric computations such as "5 miles/sec", or chemical computation such as "pentane 2 atm 400 centigrade", or genetics such as "aggttgagaggatt" (with an impressive approximate pattern matching technology), or finance "msft apple" (with a nice comparison among the two stocks), or nutrition with apple (where the meaning is disambiguated between the stock and the fruit) and nice nutrition computation such as "potassium 3 apple + 1 cheddar cheese", or personal finance computation "5% mortgage 50000 euros 20 years". I liked very much when the computational engine tried to guess mathematical formulae like "1 1 2 3 5", the famous Fibonacci numbers another tribute to another famous man born in Pisa.

After this initial demo, Stephen discussed at a very high level the 4 foundamental pieces composing Wolfram|Alpha.
  • A very large number of real time data feeds, which are continuously cleaned-up by human editors who are expert in every distinct domain. Stephen provided no precise number about how many people are involved in this cleansing step;
  • A computational engine used to map the cleansed real data feed into some internal representation. Stephen said that any information is mapped into "short fragments", which are not necessarly NPL units.
  • A query engine which maps the user queries into the internal representation. He claimed that currently they have about 22% of queries which show no matching answers.
  • A collection of ranking algorithms for selecting the best answer among the potential ones;
He said the the whole code is made up of ~6 Million lines of Mathematica code, and that more than 50K algorithms have been implemented. Unlikely, no other scientific information has been provided.

Here a collection of questions and a synthesis of the answers.

Q: how do you position W|A compared with the rest of NLP technology?
A: The problems we face are quite different from traditional NLP ones. We don't have well-strucured documents, we don't have well formed synthatical sentences with subject, verb, and so on. We adopted Wikipedia quite estensively for understanding entities and pre-aggregated data. Anyway, apart from that, Wikipedia is of a little use, even infoboxes are not good from the quality point of view . One thing that we want to investigate is to let user updload their own data to our engine, and to perfom some computation of top of that data. One additional research field that we are investigating is what we called "minimum preserving transformation", a methodology to map different, but semantically similar queries, into out internal fragment based representation.

Q: What are the resources involved in W|A in terms of people and servers?
A: We have 4 different co-location (data-centers). Every single request goes to 8 CPUs in parallel, at the very beginning we started with about 10.000 servers, now we are increasing that number.
For each request, we have a starting serving time of about 5ms when the first result is sent to the client. Then other results are injected with AJAX technology. 100 people worked on the project for about 3 years. Now some of them returned to our core Mathematica developement, but we are doing a massive hiring campain. We are now starting to mine our query logs and there is a huge amount of information there. People wants to search and not just to test knowledge

Q: How important is the caching strategy?
A: Our queries are quite different from the ones provided by search engine, where traditionally at least 25% of queries can be cached. We have a rather low percentage, almost zero. One thing that is very promising is use past succesful queries to suggest related queries to new users.

Q: Can you search Wolfram?
(vanity query worked quite well)

Q: What about a deal with Google or other engines?
A: Future is pretty interesting and we have nice relations with media and news. A bunch of new things will come next months...

Q: Do you think that Wolfram|Alpha will have a negative impact on homework? with people lazy about studying?
A: Any new technology has the same questions. I believe we had similar questions when the librarians came or when we saw CDs with encyclopedia, or with Wikipedia. Actually, I strongly believe that Wolfram|Alpha can encorauge people to study more and more.

Q: I am quite interested in the type of queries and questions you receive. How many of them are about facts which are happening right now? For instance, iran election
A: A large number of the queries we get are about real time data. We are investigating this sector. Potentially, we need a large number of feeds, clean them in real time and perform a real time computation. This is interesting and we get a lot of queries like this, even if our query log analysis is just 5 weeks old.
(disclaimer: this above was my own question, I notice that W|A has poor perfomance on facts that are happening right now, or just few weeks ago. Good they want to address the issue)

Q: How do you see the future of W|A? how difficult is to run a project like this?
A: Wolfram is an independent company and we have the money and the crazyness to throw away hundreds of millions in a project that had at beginning no future at all. I wanted to answer a question: "With current technology, is an computation engine feasible?" At the beginning I thought "No way", but I invested the money and after two years I said "Maybe", one year ago I started to think "Yes it is". Many things are necessary to drive a projects like this. You need the Web with all the feeds and the information you can get from there. You need the crazyness and you need to power to take decision in freedom, with a limited number of people involved in the decision. Basically, you need to drive it. Today, we are just at time t + 5 weeks. Answer is Yes, anyway.

Q: Do you think that this technology can be used to make guesses?
A: What do you mean?
Q: I mean humans are used to reason and make guesses. Suppose that someone is asking me "Is Los Angeles larger than Tallassee?" I never heard about Talassee, but I've heard about Los Angeles many times. Therefore, I would use the frequency of the name has an indicator of how large is the city. I would use a different measure to infer something that I don't know.
A: (Stephen started to mumble... then he continued to mumble and I saw some computations drawing on his face). Than he said "Quite interesting question, I never thought about it, I guess you are referring to rules of thumbs. So my question is how many rules are out of there? I will think about it. We have some initial guess when we try to infer mathematical formulas like in
"13 5445656 32" or when you ask for things like "5000 words". Personally, I believe that "Human reason is great, but science is better".

Q: Do you believe that W|A can answer to philosophical question? Like the ultimate question: "Does god exist?" He asked the question and got this answer:

"I'm sorry, but I don't think a poor computational knowledge engine, no matter how powerful, is capable of providing a simple answer to that question.
"

(Just after that his phone rang. I tought he received a call from above!). Was this a prepared question? Maybe, I don't know.

So the conclusion. The presentation was quite interesting and we are definitevely in front of something new. When W|A gives an answer, it is generally quite impressive and you cannot stop to play with it. The precision is good but the recall is somehow quite low. They have a low coverage within certain specific domains. Anyway, we are just at "t+5 weeks", as Stephen said. Therefore, it's too early to express a definitive judgment.

I can say that computation is quite effective when you are navigating trough specific domains such as Maths, Physics, Nutrition, Geography, Finance, and another bunch of them. There are domains where they have no data. Hence, no computation at all. And as Stephen said they just know about English, and no other language at the moment.

I have a request for W|A team, I understand the need to preserve the IP of what you are doing with patents and the like. Anyway, I hope that you will publish a little more about your results in scientific publications like all the other big engines are doing (e.g. Google Research, Microsoft Research, and Yahoo Research). I hope that you are not going in the direction to keep all the industrial results as secret ones. I remeber, I heard the term knowledge engine before in 98 and it is no longer there. I believe that this was because they decided to adopt a rather obscure way of describing their technology (personal opinion).

W|A is a quite different story, I want to see it evolving and opening to the rest of the world. I will keep monintoring your results.

Tuesday, June 23, 2009

Stock market and search

Is the stock market anticipating real economy?


Real time spammers

Spammers are invading twitter search, by stuffing a lot spam text.

What is happening in Washington ? real time text and other signals

I strongly believe real time search should adopt new ranking signals, and not being just text based.











Monday, June 22, 2009

How Much Can Behavioral Targeting Help Online Advertising?

How Much Can Behavioral Targeting Help Online Advertising? is a Microsoft paper about Internet Monetization. The paper proves how behavioural targeting can help in improving monetization and click-through (CTR).

In other words, users with similiar search needs tends to click similar ads. Segmenting users by means of clustering is therefore a good strategy for increasing the revenues of a search engines. Users are profiled by means of the search queries and the clicked URLs. Standard clustering algorithms (such as K-Means, min-wise hashing and CLUTO) are then used to segment the users.

Obtained results are pretty impressive.

Sunday, June 21, 2009

Speeding up Algorithms on Compressed Web Graphs

Speeding up Algorithms on Compressed Web Graphs is a Microsoft paper @ WSDM09, which leverages a smart tranformation for compressing the Web graph. The idea is pretty simple: an heuristic based on frequent itemset mining is used for tranforming cliques into stars connections.
As a consequences, the number of edges can be drammatically reduced at the cost of introducing some new virtual node.

The paper then introduce an optimized matrix-vector multiplication which reoders the adjencency matrix of the tranformed Web graph. This multiplication is then used to speed-up PageRank, Hits and Salsa on the compressed graph.

Saturday, June 20, 2009

Iran Protests


Steve Ballmer on Search

Nice article on Bing, the new Microsoft search engine.

"Sometimes the error you make is what you don't do or what you don't start soon enough," he said. "Most of our mistakes came not because we didn't see the technology change that was coming. Ironically, we didn't see the business change that was coming."

"He blames Microsoft's corporate heft, in part. Microsoft had spent richly on research and development. Its R&D budget comes to $9 billion this year alone. And the company had plenty of people working on producing a search engine. But Microsoft had lost its startup brashness. New companies have an edge: They have to succeed big or go bankrupt. That forces them to take risks fast, before it's too late."

"We've got our mojo going now. We're rolling. We're the little engine that could."

Google Flipper

I like the idea behind Google Flipper, we had some visual idea and patent here

Thursday, June 18, 2009

Real time search (what about Ranking?)

The number of real time search engines is increasing. Yesterday, two new engines joined the race. CrowdEye is from Ken Moss, who ran search engineering at Microsoft and built the new engine himself. Collecta is from Gerry Campbell, who was a search executive at AOL and Reuters, as well as an adviser to Summize (now Twitter Search). OneRiot who is run by Kimbal Musk and my old friend Alessio Signorini. Other engines are Topsy, Tweetmeme and Scoopler, not to mention Twitter Search itself.

This reminds me the old times when Excite, Lycos, Altavista and a lot of new-comers joined the search race -- back in 1997. Search is alive and kicking with new exciting stuffs to play with.

I would expect more and more academic publication on real time search problems, such as real time ranking.

Carol and Tim

The Tradeoffs Between Open and Traditional Relation Extraction

The Tradeoffs Between Open and Traditional Relation Extraction is a paper about extraction of relations between entities on a massive scale. The system is based on a self-training unsupervised approach derived by the Conditional Random Fields (CRF) theory. A set of training examples are extracted by a training corpus by means of hand-crafted defined parsing rules. The training examples are then used to train a linear chain of CRF.
.

Results are pretty impressive both in term of precision and recall.

Monday, June 15, 2009

Fresh Correlations: Valentino Rossi and Jorge Lorenzo

Valentino is correlated to Jorge, algorithm says. Here is the motivation:

"The world champion's win at the Catalan Grand Prix came on the final corner of a last lap prize-fight with Jorge Lorenzo, his Fiat Yamaha team-mate, and means the pair are tied at the top of the MotoGP standings with Casey Stoner on 106 points. "

Sunday, June 14, 2009

Go Lakers Go!

Multimedia clustering algorithm captured the fresh correlations between news articles, blogs, images, video, faces and names. All in real time.

Twitter, OneRiot, Google

Interesting article from NYT on Real time search

"TECHNOLOGY blogs have wondered whether Google is a lumbering giant in this Twitter moment, unable to handle streams of tweets that were broadcast just seconds earlier."

"A number of search start-ups have appeared recently that differentiate their offerings from older search engines’ by playing up their specialized focus on the real-time Web. For example, OneRiot, based in Boulder, Colo., covers Twitter among other social media, but it has an intriguing means of reducing Twitter spam: it does not index the text in tweets — it plucks only the links, reasoning that the videos, news stories and blog posts that are being shared are what others will be most interested in."

"Google’s almost-real-time search provides much higher-quality results than does literal real-time search. When speaking about the need to index the Web “every second,” Mr. Page acknowledged the usefulness of taking a wee bit of time to analyze the gathered information. “If you really want up-to-the-second information, it’s not going to be as good as if you’re willing to wait a couple of minutes,” he said. “I’m not sure everybody needs to be seeing this stuff every second.”

Friday, June 12, 2009

Efficient Interactive Fuzzy Keyword Search

Efficient Interactive Fuzzy Keyword Search is a pretty nice paper about a new information-access paradigm, called“interactive, fuzzy search,” in which the system searches the underlying data “on the fly” as the user types in query keywords. This is useful for search suggestion services provided while users are typing search queries.

The underlying data structure is a combination of tries, inverted lists, and smart caching techniques used for achiveving good performance. The basic data-structures are then expanded for supporting ranking and synonims.

I enjoyed the paper, but the index dimensions are so small that all the data-structures can fit in memory (even without compression)

Inverted Index Compression and Query Processing with Optimized Document Ordering

Inverted Index Compression and Query Processing with Optimized Document Ordering is a paper describing the state-of-the-art compression techniques for inverted list data structure, after document re-ordering. A reference reading paper.

I wonder if this work can be extended by taking into account the modern search engines are storing all the index in memory. This is also including inverted lists. Therefore, one can explore the benefits of having different hierarchies of memory.

Thursday, June 11, 2009

Yahoo Distribution of Hadoop

Yahoo announced the availability of the Yahoo Distribution of Hadoop, a source-only version of Apache Hadoop that Yahoo uses within its own search engine.

Congratulation to Doug Cutting, a very friendly man who is behind cool things such as Nutch, Lucene and Hadoop.

When will you return in Italy to visit us? This is an official invitation!

Fiat , Chrysler deal done: news, video, blog and images

When you see a fresh topic, you may want to have all different ingredients covered: news articles, blogs, fresh images, faces and videos. Here it is an example of our multimedia clustering technology working in real time on thousands of sources. This is coming from Italy:

Tuesday, June 9, 2009

Bing is growing

A nice article from techcrunch "According to comScore, Microsoft Sites increased its average daily penetration among searchers in the United Stated from 13.8% during the period of May 26-30 to 15.5% during the period of June 2-6, 2009, an indication that the search engine is reaching more people than before. Microsoft’s share of search engine results pages (SERPs) in the U.S., increased from 9.1% to 11.1% during the same time frame."

Here the original Comscore article

Microsoft Sites Search Performance
6/2/09-6/6/09 vs. 5/26/09-5/30/09
Total U.S. – Home/Work/University Locations
Source: comScore qSearch

5/26/09-5/30/09 6/2/09-6/6/09 Point Change
Searcher Penetration (Avg. Daily) 13.8% 15.5% 1.7
Share of Search Results Pages 9.1% 11.1% 2.0

Monday, June 8, 2009

Oneriot is growing steadly

Oneriot, the real-time search engine is growing steadly. They closed a very good deal with Microsoft.

Congratulation to my old friend Alessio and to Kimbal, who likes to mix the passion for technology and new business ideas with some great food experience.

Alessio, ganzo deh!

Exploiting Web Search Engines to Search Structured Databases

Exploiting Web Search Engines to Search Structured Databases is a Microsoft paper about the integration of verticals in Web search results. The key idea is that each vertical is a-priori associated with a list of relevant entitites. When a user submit a query, those entities are extracted by the search results snippets. In this way, the query itself is implicitely expanded with vertical related entities found in search results (note that similar idea, has been investigated by Yahoo in a paper for adversiting).

The paper discusses many off-line pre-processing techniques for extracting entities before query time. A mixed approach based on trie pattern-matching and svm classification is proposed. Extracted entities are then ranked by taking into account proximity, frequency count and relevance of documents. For a given query, a vertical result is then triggered when 'good enough' entities are retrieved. A case study based on product search vertical and Microsoft Live search is discussed. Entities are extracted from Wikipedia, Trec QA, and IMDB.

The approach is quite effective and the performances are pretty good. Anyway, it can show some limits of real time verticals (such as Twitter or News) where entities are not known a priori.

Saturday, June 6, 2009

Traffic vs Trackin

An interesting study about user tracking

Is Bing growing?

Source: StatCounter Global Stats - Search Engine Market Share



The company's analysis for Thursday finds that in the U.S. Bing overtook Yahoo to take second place on 16.28%, with Yahoo Search currently at 10.22%. For the sake of comparison: Google's U.S. market share is pegged at 71.47%, and its worldwide share at a whopping 87.62% (vs. 5.62% for Bing and 5.13% for Yahoo).

Friday, June 5, 2009

Goodbye Rajeev Motwani

Although I don't have a direct connection with Rajeev Motwani, he has become a source of inspiration for me to start studying computer science many years ago. Thanks for your books, your papers, and the amazing contribution you gave to silicon valley community. Keep us creative from above.

Social Networking Growth

Thursday, June 4, 2009

Personalized Recommendation on Dynamic Content Using Predictive Bilinear Models

Dynamic Content Using Predictive Bilinear Models is a Yahoo! paper about suggesting relevant news stories for Yahoo's "Today" module The paper describes a bi-linear model leveraging both instantaneous dynamic click-through information and static content based features. The objective function adopted is concave and minimized with a gradient descent approach.

The authors collected about 40 million click/view events by about 5 million users from the random bucket before a certain time stamp for training. One evaluation metric adopted is the number of clicks in each rank position. The intuition is that a good predictive model should have more clicks on the top-ranked positions and lesser clicks on the lower ranked position. Results are quite interesting: the model shows the benefits of adopting dynamic click-through data.

Unluckly, no information is given about the time needed to process the data. A rather important point when you process (near) real time data.

Wednesday, June 3, 2009

Visual Diversification of Image Search Results

Visual Diversification of Image Search Results is a Yahoo paper about image clustering that I liked very much. The paper deploys lightweight clustering techniques, to best capture the discriminative aspects of the resulting set of images that is retrieved.

For earch image, a set of representative features are extracted such as Color histogram, Color layout, Scalable color, CEDD, Edge histogram, Tamura. Strangely enough the authors are not considering SIFT and wavelet signatures, which are quite popular these days.

Authors evaluated three different algorithms:
  • The folding algorithm appreciates the original ranking of the search results as returned by the textual retrieval model. Images higher in the ranking have a larger probability as being selected as a cluster representative. In one linear pass the representatives are selected, the clusters are then formed around them.
  • The maxmin approach also performs representative selection prior to cluster formation,but discards the original ranking and finds representatives that are visually different from each other.
  • Reciprocal election lets all the images cast votes for other images that they are best represented by. Strong voters are then assigned to their corresponding representatives, and taken off the list of candidates. This process is repeated as long as there exist
    unclustered images.
As you can imagine, the number of clusters is never fixed, because it is impossible to predict a priori what a good value is. Evaluation is carried out by using 200 manually created clusters, used as ground truth.

Evaluation measures are the Folwkes-Mallow index and the Variation of Information Criterion. Folding is the best algorithm under the FM index evaluation, and Reciprocal is the best one according to the Variation of Information Criterion.

I reccomend reading the paper if you are interested in Image search and in Clustering. Just one observation to the authors. Since the target of this work is an use in production, I would have appreciated a comparison of the time needed to extract the features and to cluster the results with the three different algorithms.

Binging Google, Yahoo and Ask

Why Bing is exposing a search box which uses Google, or Yahoo or Ask.com?

Are they suggesting a comparison?


Tuesday, June 2, 2009

Academic works on News and Freshness

Steven Skiena is quite famous among the Algorithmic comunity for his beatiful books. His research studies for freshness and, in general, on stream data analysis, are less known.
  • TextMap identifies trends in the temporal and geographic interest in entities such as people, places, and things by analyzing roughly 1000 daily English language newspapers.
  • TextMed identifies the relationships between medical or biological entities through an analysis of PubMed/Medline abstracts.
  • TextBiz uses random-walk models to generate a probability distribution on the future prices for all NASDAQ, NYSE, and AMEX stocks.
If you are interested, I suggest you visiting this page with a list of academic papers.

Monday, June 1, 2009

Real Time Gossip: Susan Boyle

Query log search suggestions and News search suggestions

Sometime query log based search suggestions show problems. Sergio Marchionne is the CEO of Fiat Spa, he was discussing with Angela Merkel about buying Opel, the General Motors german assets.