Algorithm

The algorithm we present is dubbed "Graphitour" to acknowledge a strong influence of the Sequitour algorithm [6] for discovering the hierarchical structure in sequences. In particular, Graphitour just like Sequitour restricts the set of candidate structures to bigrams, which allows it to eliminate a rather costly lexicon lookup procedure, as well as sub-graph isomorphism resolution, which is NP-hard. Note that Graphitour in principle admits labeled nodes and edges, particularly since unlabeled graphs turn into labeled at the first iteration of the algorithm.

The algorithm is presented in Table 2 and best understood through the illustration of every step on a trivial example of Figure 2 as described below.

**Figure 2:** **left:** A simple graph with labeled nodes corresponding to an imaginary small molecule and a corresponding sample edge lexicon with type frequency counts. **right:** Non-bipartite maximum cardinality matching.
[height=5.5cm]53.eps [height=5.5cm]YX.eps

**Figure 3:** A graph decomposition into subgraphs induced by edges of different type.
[height=5.3cm]54.eps

**Figure 4:** A sequence of abstraction steps, right to left: a new entry is made into the lexicon, called "HC" which corresponds to an abstracted edge.
[height=5.3cm]60.eps

The next step is to select which kind of edge will be abstracted into a (hyper)node corresponding to a new compound element. A greedy choice would have picked the most frequent type of edge, disregarding potential conflicts due to shared node interactions. Instead Graphitour proceeds by decomposing the graph into three graphs each containing only edges of one type, which is illustrated by Figure 3, and counting how many edges of each type could be abstracted, taking into account node interactions.

One of the key contributions of the algorithm described here is in relating the graph grammar induction to the maximum cardinality non-bipartite matching problem for which there is a polynomial complexity algorithm due to Gabow [4]. The problem is formulated as follows: given a graph, find the largest set of edges, such that no two edges share a node. Figure 2.right illustrates this on an arbitrary non-bipartite graph. Dashed edges represent a valid matching. The description and implementation of the algorithm is quite technically involved and does not belong to the scope of this paper. Note that while applying Gabow's elegant algorithm essentially allows us to extend the horizon from one-step-look-ahead greedy to two-step-look-ahead, the bottom-up abstraction process still remains a greedy heuristic. Only one partial graph of Figure 3, containg "C-H" edges, makes a non-degenerate case for a maximal non-bipartite matching since edges interact at the node "H", obviously returning only two out of four edges subject to abstraction, marked with an oval shape on the figure. A new entry is made into the lexicon, called "HC" which corresponds to an abstracted edge, as illustrated by two steps going right to left in Figure 4.

The leftmost sub-figure of Figure 4 is an input for the next iteration of the algorithm, which detects the new repetitive compound: edge "HC-C", which is abstracted again, ultimately producing a graph grammar shown in the Figure 1 on the right. Note that as the algorithm iterates through the graph, intermediate structures get merged with the higher level structures.