2008-04-26

Navigating the Ocean of Information - Past, Present and Future

In 1945, Vannevar Bush, considered the grandfather of hypertext, was already concerned about the information explosion in which we live today. In his essay As We May Think he wrote: "The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships."

Since then, with the emergence of the Internet and the evolution of the search engines, great progress has been made in this area. In the beginning, when the available information was very limited, a simple pattern matching to search for words was enough to return to the users the relevant documents. Because the number of documents was small, the user was able to rapidly inspect the results and decide if he had found the information he was looking for or not.

As more information was added to the world wide web, more advanced techniques to rank search results were needed, since it was no longer viable for the user to analyze all documents that contained the keyword. Many heuristics are used to try to determine the relevance of a document in relation to a keyword, such as its presence in the title of the page, the number of times it appears divided by the total number of words, the distance between the searched terms in a document, etc. In the same manner, information external to the document is also used to determine its relevance, such as anchor texts of other sites that link to it.

Ranking algorithms that take better advantage of external information tend to present better results. With the increasing volume of information available, companies desire better visibility in search engines results, a fact which gave birth to the area called Search Engine Optimization. The information contained inside a document is easily manipulated, therefore it is easy for a company to create a site with poor quality content that appears among the first results of specific queries. On the other hand, changing the relative importance that other people attribute to a site is more difficult. This led search engines to start using algorithms that give more weight to pages linked to by other important pages, the most famous algorithm being PageRank developed by the Google founders. In this case, the relevance of a link is determined through a collaborative production, in which all sites in the search engine database participate, lowering the impact that local optimization techniques could have in the global results, bringing better results to all of the users of the search engine.

Yet, specialists in Search Engine Optimization develop techniques to try to cheat even the more advanced algorithms by, for instance, buying several domains that link to each other, or paying sites with high visibility to include links to their sites. It is a constant war between search engines and spammers, the former trying to improve their algorithms and increase their computation power, while the latter study new ways to increase the visibility of their sites.

However, a new model to determine what is interesting in the Internet has been gaining more space in the last years. Instead of leaving the work to determine the relevance of a document solely to an algorithm that makes a superficial analysis of the whole Internet production, communities of people interested in the subject adopt this task, constantly providing and updating data about the quality of the documents. Many people believe this will be the future of the search engines. One such example is Wikia.

But how will people and algorithms interact, in some sort of Human-based computation, to achieve the best results? This question is far from a definitive answer, but rests upon the many people interested in working together for a solution.

2008-04-14

Permissive licenses and the restrictions placed upon them

Permissive licenses like the BSD and MIT impose few restrictions on the use and redistribution of the software. Created in an academic environment, they are based on the principles of publishing and reusing ideas with as much freedom as possible.

The absence of stronger conditions for the distribution of software under these licenses implies limitations for its use with other licenses that work with the copyleft principle, such as the GNU GPL, which demands that any derived work that is to be distributed must be under the same terms of the original license, as it says in section 2 of GPLv2:
You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License

Still, this restriction applies only to uses that are protected under the copyright law. As the GPL says, you don't have to accept the license, but nothing else grants you permission to do what otherwise would be prohibited by law. The main exclusive rights under the copyright law are to copy, to distribute and to create derivative works. There are other restrictions that are applied through software patents, but these aren't valid in many countries (including mine) so I won't discuss it any further.

Therefore, it is worth noting that the simplest form of use of software is free under any circumstances. This means that using an operating system or an integrated development environment licensed user the GPL when you are developing your software doesn't force you to license it under GPL terms.

Other uses, such as distribution, contributions, derivative works and use through linkage, may be subject to specific conditions. Most of the licenses aren't very precise about the meaning of each of these terms, and so the involved parties are left with an interpretation problem. Even considering that the Free Software Foundation explains in other documents what was their intention when they wrote the license, these clarifications don't have any legal value unless we are considering a piece of software owned by them. As a result, because of the ambiguities in the text of the licenses, it is often not clear what would be the result in case of litigation.

Nevertheless, it's valid to try to understand the position of the FSF, to serve as a parameter in the discussion about licensing a software under a permissive license when it is related to another software that uses the GPL or LGPL . Next, let's consider a few cases:

  • Modification of the source code to be distributed as a derivative work or to be returned as a contribution to the original work: it must be released under a compatible license. In the case of LGPL, the new work must be distributed as a library;
  • Creation of a new work that uses a library licensed under the GPL: there is high controversy around this use, because some people consider that when a program uses a library a collective work is being created, composed by the program plus the library, instead of saying that the program is a derivative work of the library. On the other hand, according to the FSF, in this case the "viral clause" must be applied, because the program as it is actually run includes the library (see FAQ). However, the GPL says that if there are identifiable sections of that work that are not derived from the program protected under the license, and if the sections can be reasonably considered independent, then they are not required to be licensed under the GPL when distributed as separate works. In fact, many developers try to escape from the GPL by distributing their code without the required libraries. But this may not work because one could argue that the code isn't really independent from the GPL licensed work. Moreover, if the work is distributed as part of a whole based on the work licensed under the GPL, then the distribution of the whole must be on the terms of the GPL, whose permissions for other licensees extend to the entire whole, and thus to each and every part of it;
  • Creation of a new work that calls functions from a software licensed under the GPL, but that doesn't need it to be compiled or to run: there is even more controversy in this case, but in general it is subject to the same restrictions, that is, if we consider the more conservative approach, it is necessary to license the work under the GPL terms;
  • Creation of a new work that uses a library licensed under the LGPL: in this case, since the LGPL is a variation from the GPL used precisely to allow the use of libraries by software that doesn't comply with the GPL, it is allowed to license the work under other terms, such as the BSD license.


To conclude, we should also note that the BSD license itself isn't perfectly clear about what the licensee is allowed to do either, since the terms "redistribution and use" and "with or without modification" can't be mapped directly to the exclusive rights mentioned under the copyright laws. It is implicit that all the rights are being given, but other licenses, such as the MIT license, are more explicit about such things.

2008-04-09

Blog 2.0

This will be my second attempt of writing a blog in English.