Results

In this section we take a closer look at the results of running Graphitour on raw data obtained from a PDB file of a DNA molecule by stripping off all the information except for the list of atoms and the list of of covalent bonds. To simplify the representation of results, hydrogen atoms and corresponding edges were removed from the graph. Additionally, to support the symmetry, we artificially linked the DNA into a loop by extra bonds from 3' to 5' ends. The resulting list of nodes and edges in the incidence matrix is presented as input to Graphitour and is depicted in Figure 5. Figure 6.left and 6.right show two sample non-terminals from the actual output from the Graphitour software, illustrating the discovery of basic DNA elements.

**Figure 5:** Layout of original graph corresponding to a small DNA molecule.
[width=12.5cm,height=9.cm]WholeDNA.ps

**Figure 6:** **left:** The compound object induced by the *Graphitour* algorithm, which corresponds to the base of Guanine. All four DNA bases were identified as non-terminal nodes in the resulting graph grammar. **right:** The compound object induced by the *Graphitour* algorithm, which corresponds to the backbone of the molecule: phosphate and sugar
[height=7cm]Guanine.eps[height=7cm]SugarPhosphate.eps

Figure 6.left shows the compound object induced by the Graphitour algorithm, which corresponds to the base of Guanine. All four DNA bases were identified as non-terminal nodes in the resulting graph grammar. This particular structure is identified as a non-terminal by the number in a list of non-terminals and described as type in the lexicon of compounds. It is at the depth in the hierarchy tree of the whole graph, being joined with the corresponding backbone phosphate & sugar one level higher in the upstream; subsequently joined with the complete nucleotide adjacent to it, and so forth.

Figure 6.right shows the compound object induced by the Graphitour algorithm, which corresponds to the backbone of the molecule: joined phosphate and sugar. This particular structure is identified as a non-terminal by the number in a list of non-terminals and described as type in the lexicon of compounds. It is at the depth in the hierarchy tree of the whole graph, same depth as compound in Figure 6.left which makes sense. Note that although the number of lexicon entries is over a hundred, the vast majority of numbers were only used for intermediate entries, which got merged into composite elements. There is no reusing of the available lexicon slots. Such re-use could give some extra compression efficiency (compare to Sequitour [6]). The resulting lexicon is small and all the entries in the final lexicon are biologically relevant compounds.

Graphitour actually discovers the hierarchical structure corresponding to the kind of zoom-in hierarchical description we gave in the introduction. The algorithm identifies all of these structural elements without exception and does not select any elements which would not make biological sense. The significance of such result might not be immediately clear if one does not take into account that the algorithm does not have any background knowledge, any initial bias, nothing at all except for the list of nodes and edges.

**Figure 7:** Compositional elements of E.coli regulatory network.
[width=6cm]Star4Alon.eps[width=6cm]Star3Alon.eps [width=6cm]Star2Alon.eps[width=6cm]X1Alon.eps

As an example of a rather different application domain, we tried Graphitour on escherichia coli transcriptional regulation network, since it is one of the well known attempts to discover "network motifs" or building blocks in a graph as over-represented sub-graphs by Alon et al. [9]. In this network there are nodes corresponding to the genes or groups of genes jointly transcribed (operons). Edges correspond to the known interactions: each edge is directed from an operon that encodes a transcription factor to an operon that it directly regulates. For more information on the biological significance of the motif discovery in regulatory networks we refer you to the cited paper [9] and supplementary materials at the Nature Internet site. Alon et al. find that much of the network could be composed of repeated appearances of three highly significant motifs, that is a sub-graph, of a fixed topology, in which nodes are instantiated with different node labels. One example is what they call a feedforward loop $X\rightarrow Y \rightarrow Z$ combined with $X \rightarrow Z$ , where are instantiated by particular genes. In order to compare to the results of Alon et al. we label nodes simply according to cardinality, omitting gene identities, and take into account directionality of the edges. No hierarchical structure or feedforward loops were distilled from the data. Rather, Graphitour curiously disassembled the network back into the sets of star-like components as illustrated by Figure 7, which ought to have been initially put together to create such network. It leaves open the question of what is a useful definition of a network component in biological networks, if not the one requiring the most parsimonious description of the network.