Discussion

We presented a novel algorithm for the induction of hierarchical structure in labeled graphs and demonstrated successful induction of a multi-level structure of a DNA molecule from raw data--a PDB file stripped down to the list of atoms and covalent bonds.

The significance of this result is that the hierarchical structure inherent in a multi-level nested network can indeed be inferred from first principles with no domain knowledge or context required. It is particularly surprising that the result is obtained in the absence of any kind of functional information on the structural units. Thus, Graphitour can automatically uncover the structure of unknown complex molecules and ``suggest'' functional units to human investigators. While, there is no universal way to decide which components of the resulting structure are "important" across all domains and data, frequencies of occurrence is one natural indicator.

One exciting area of application which would naturally benefit from Graphitour is automated drug discovery. Re-encoding large lists of small molecules from PBD-like format into grammar-like representation would allow for a very efficient candidate filtering at the top levels of hierarchy with the added benefit of creating structural descriptors to be matched to functional features with machine learning approaches.

Analysis of proteins benefits from the same multi-level repetition of structure as DNA, and Graphitour gets similar results for individual peptides. Analyzing a large set of protein structures together is our work in progress. The hope is to reconstruct some of the domain nomenclature and possibly re-define the hierarchy of the domain assembly.

The current version of the Graphitour corresponds to a so-called non-lossy or lossless compression. This guarantees that the algorithm will induce a grammar corresponding precisely to the input data, without over- or under-generating. The drawback of this approach is that the algorithm is sensitive to noise and has rather poor abstraction properties. Future work would require developing lossy compression variants, based on the approximation of description length minimization.

Finally, it would be quite interesting to analyze the outcome of the algorithm from a cognitive plausibility point of view, that is to compare structures found by human investigator to these descovered by the algorithm on various application examples.

Leonid Peshkin 2007-03-23