2008-04-26

Navigating the Ocean of Information - Past, Present and Future

In 1945, Vannevar Bush, considered the grandfather of hypertext, was already concerned about the information explosion in which we live today. In his essay As We May Think he wrote: "The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships."

Since then, with the emergence of the Internet and the evolution of the search engines, great progress has been made in this area. In the beginning, when the available information was very limited, a simple pattern matching to search for words was enough to return to the users the relevant documents. Because the number of documents was small, the user was able to rapidly inspect the results and decide if he had found the information he was looking for or not.

As more information was added to the world wide web, more advanced techniques to rank search results were needed, since it was no longer viable for the user to analyze all documents that contained the keyword. Many heuristics are used to try to determine the relevance of a document in relation to a keyword, such as its presence in the title of the page, the number of times it appears divided by the total number of words, the distance between the searched terms in a document, etc. In the same manner, information external to the document is also used to determine its relevance, such as anchor texts of other sites that link to it.

Ranking algorithms that take better advantage of external information tend to present better results. With the increasing volume of information available, companies desire better visibility in search engines results, a fact which gave birth to the area called Search Engine Optimization. The information contained inside a document is easily manipulated, therefore it is easy for a company to create a site with poor quality content that appears among the first results of specific queries. On the other hand, changing the relative importance that other people attribute to a site is more difficult. This led search engines to start using algorithms that give more weight to pages linked to by other important pages, the most famous algorithm being PageRank developed by the Google founders. In this case, the relevance of a link is determined through a collaborative production, in which all sites in the search engine database participate, lowering the impact that local optimization techniques could have in the global results, bringing better results to all of the users of the search engine.

Yet, specialists in Search Engine Optimization develop techniques to try to cheat even the more advanced algorithms by, for instance, buying several domains that link to each other, or paying sites with high visibility to include links to their sites. It is a constant war between search engines and spammers, the former trying to improve their algorithms and increase their computation power, while the latter study new ways to increase the visibility of their sites.

However, a new model to determine what is interesting in the Internet has been gaining more space in the last years. Instead of leaving the work to determine the relevance of a document solely to an algorithm that makes a superficial analysis of the whole Internet production, communities of people interested in the subject adopt this task, constantly providing and updating data about the quality of the documents. Many people believe this will be the future of the search engines. One such example is Wikia.

But how will people and algorithms interact, in some sort of Human-based computation, to achieve the best results? This question is far from a definitive answer, but rests upon the many people interested in working together for a solution.

No comments: