Trie Again

I have been doing some more thinking (and, of course, coding) about tries. 

I have realised, somewhat belatedly, that the important thing about a trie is the way it uses a series of integers (in my case) to navigate a tree of nodes to reach a point unique to every different series of integers.  What gets stored there is irrelevant to the trie but important to the user.  So a trie is best thought of as an implementor of a map (java.util.Map in my case).  The tricky thing is that the supplied key is not used directly, but that the creator of the map has to provide a means of converting the key to a series of integers.

I have implemented TrieMap in this way. 

public final class TrieMap<K, V> implements Map<K, V> {

So that a TrieMap<K,V> can be used anywhere Map<K,V> is required.  The tricky bit is informing thee TrieMap how to turn instances of K into a series of integers.  This has been done by supplying the constructor with an instance of TriePathIteratorFactory, thus:

public TrieMap(final TriePathIteratorFactory<K> pathIteratorFactory)

Where a TriePathIteratorFactory is responsible solely for generating instances of TriePathIterators:

public interface TriePathIteratorFactory<K> {
     public TriePathIterator<K> pathIterator(K k);
}

And the important part of the TriePathIterator is the generation of the series of integers:

public interface TriePathIterator<K> {

/**
* @return true if the receiver has more elements in it path, false otherwise
*/
public abstract boolean hasNext();

public abstract int nextIndex();

}

Now when the user calls a standard method from java.util.Map that takes a key (K), the TrieMap creates a TriePathIterator instance and uses it to traverse the data structure.

Using the underlying structure from the TrieMap I have implemented a TrieSet – simply a subclass of TrieMap<K,V> where K === V.  The performance exceeds that of the supplied java.util.TreeSet.  I will provide some hard data in another post, but the work-in-progress is available from the git repository on the ‘map’ branch.

Generic Java Trie

I have been looking at Tries again.  Thats right ‘Trie’.  Wikipedia has an excellent article.  Tries are used to sort data (usually strings) that can be found/retrieved/removed quickly. 

The idea is simple: imagine the list of names in your address book.  One page for all the names starting with ‘A’, one for all the names starting with ‘B’ and so on. 26 pages, and we can jump straight to the right page instead of looking at each name in turn.  But what if there are lots of names on the ‘S’ page? Simple, we create a section on the page for every second letter ‘(S)A’, ‘(S)B’ and so on.  Now when we have a name starting with ‘SM’ we jump to the ‘S’  page, look at the second letter and jump to the ‘M’ section.  Now imagine doing this for every letter – we would have lots of pages each with only one name on it.  That is a ‘trie’.

Why would we want to use a trie?  Well they are very quick to find names, or remove names, and if we work our way through all the pages we get an alphabetic list of all the names.

Why might we not want to use a trie? Well, for one they can be slower when we are adding names as we have a lot more work to do.  And for  small numbers of names finding and removing are not much faster than simple lists.  They also use more memory – though for most modern machines this may not amount to a significant difference.

The slower addition is the main drawback, but as usual things are not as simple.  If you do not want your list of names sorted, or you are happy to sort them each time you get the list, then this slower addition may rule out using tries. But if you want to get the sorted list often then the cost of sorting makes the decision much closer.  My initial experiments in Java showed that adding elements to a normal list and then sorting it took about the same time as adding the same elements to a trie.  Which is not surprising, since the task is about the same in each case.  The difference being that sorting subsequent additions to a list is a lot slower than adding extra elements to a trie.  And again this is obvious.  If I have a list of 10000 names and add 100, a list may have to sort 10100, but a trie only has to find the right place for each of the 100.

I have implemented a Trie  (licenced under GPL3) in Java (git repository) that adopts the java.util.Collection interface.  This means that it can be used interchangeably in your app wherever you have already used another collection.

I have a bunch of things I still want to do, and maybe I will get round to them:

  • Publish some performance figures.  Currently for large sets find/remove are about 1000 times faster than java.util.ArrayList.  But these times and the insertion costs are dependent upon how ‘dense’ the data is.
  • Demonstrate the use of the Trie to store data other than strings, for example Event objects with a comment String and Date, sorted by date component eg year,month,day,hour, minute, second,millisecond.
  • Sort out what to do with elements with duplicate paths – are they the same, or should they be added in a list?