Brigitte Vallee - Information Theory: Models, Algorithms, Analysis
Monday, April 12, 2010
Every student of a basic algorithms course is taught that, on average, the complexity of Quicksort or the path-length of a Binary Search Tree is O(n log n), that of QuickSelect is O(n), and that of RadixSort is O(n log n). Such statements are simple, and have the obvious merit of offering an easy-to-grasp picture of the complexity landscape. However, they are based on specific assumptions, that differ from each to the other: for the first two, the unit cost is the comparison between data items and, for the third one, the unit cost is the is the comparison between symbols.
This does not make possible a precise assessment of the relative merits of algorithms and data structures in a way that would satisfy the requirements of either information theory or algorithms engineering. Furthermore, such simplified analyses say little about a great many application contexts, in databases or natural language processing, for instance, where information is highly non-atomic, in the sense that it does not plainly reduce to a single machine word.
In order to perform a realistic analysis of such algorithms, we have to define an adapted framework, first in information theory, where the basic object is a source; a source is just a random mechanism which emits symbols of a given alphabet. We provide here a general modeling for a source, which encompasses the simple sources (memoryless sources or Markov chains), together with more complex sources, where the correlation between symbols may depend on the whole previous history. The inputs of the algorithm are just words produced by the same source; the unit cost is, for all the three algorithms, the comparison between symbols, and the cost of the algorithm is thus the total number of comparisons performed between symbols.
We revisit, in such a unified and realistic model, the probabilistic analysis of three main algorithms: QuickSort, QuickSelect, and dictionary algorithms based on the trie structure. This talk is based in joint works with Julien Clement, James Fill and Philippe Flajolet. 1