OK, so there is more to Tries than I thought

I thought that I had done with tries….. but I was wrong.  I chanced upon a subtle refinement which makes the solution even more general. To save repeating myself I will assume that you have read my previous posts about tries.

Instead of indexing the tries with an Iterator<Integer> I have modified the code to use Iterator<Comparable>.  It still works with Iterator<Integer> as Integer implements Comparable.  The advantages are significant.  For example you could index sentences by the contained words…. it was this question in StackOverflow that prompted me.

Now you can create tries that are indexed by any type at all as long as that type implements comparable.  Ordered lists of playing cards, steps in martial arts routines, key sequences on a keyboard – your imagination is the only limit.  What you get is an order-N insertion/retrieval of a sorted set.

 

Trie Finally

Yep, I know, tries are nothing new.  They are a very good way of storing sorted collections with order-N insertion and lookup times.  There are many implementations of tries available in most of the common languages (and probably the uncommon ones too, but I did not look too closely).  There are, however two problems with the existing solutions:

  1. They all store strings: – the sequence of values used to define the path to the stored value are always the characters of a string converted to integer values.
  2. They all store strings: all anybody ever stores in the trie are strings.  More specifically the same strings as they used to define the path to the storage position.

Now storing strings indexed against their contents is fine – you get an order-N sorted string storage mechanism.  And if that is all that you want then go for it.  But a trie could be so much more.  To prove this I have implemented a variant of the trie where I have broken from tradition in two ways:

  1. You do not have to store strings: the sequence of integers used to define the position in the trie is specified with an Iterator<Integer>.  You can easily derive this iterator from a string (I have a class for that) and use the string as a key, but this is not all you can do.  You could derive the iterator from a date, and yield year, month day, hour, minute, secon, millisecond to quickly sort and retreive by date (I will soon have a class for that).  Or you could define your own mapping as long as you can generate a sequence of integers from it.
  2. You do not have to store strings: Each position in the trie can store an instance of your desired type – trie is generic.
  3. The underlying structure is exposed: (yep, point three of two) By exposing the underlying structure some operations become faster, and the semantics of addition and removal become more flexible.

So if you want to store your Customer instances sorted against surname, or date of birth you can now do so with order-N insertion, lookup and removal in a trie (for name, or for date, or one of each).  You can use the tries to extract all customers with surnames beginning with ‘Smi’, or all customers born in january 1987.

And it is simple – there are really only three methods to think of.

Three methods? that sounds too simple – I exagerate for dramtic purposes – there are several other methods to support query and iteration, but the essential methods are:

  • public IndexTrie<T> putTrie(Iterator<Integer>) : returns the trie at the position specified, creating and inserting a new one if necessary.
  • public IndexTrie<T> getTrie(Iterator<Integer>) : returns the trie at the position specified or null if there is not one there.
  • public void remove() : removes the specified trie and all its children.

So if you are storing a list of Foo at each position you do not need to get the value and if null insert the value paying the navigation cost twice.  You can ‘put’ the trie at the position and modify its contents directly – paying the navigation cost once.

Other methods allow you to iterate the tries or values from any parent (and as we know from all those string implementation this gives us the members with a given prefix).

The code is available under Apache V2 licence.

Trie Again

I have been doing some more thinking (and, of course, coding) about tries. 

I have realised, somewhat belatedly, that the important thing about a trie is the way it uses a series of integers (in my case) to navigate a tree of nodes to reach a point unique to every different series of integers.  What gets stored there is irrelevant to the trie but important to the user.  So a trie is best thought of as an implementor of a map (java.util.Map in my case).  The tricky thing is that the supplied key is not used directly, but that the creator of the map has to provide a means of converting the key to a series of integers.

I have implemented TrieMap in this way. 

public final class TrieMap<K, V> implements Map<K, V> {

So that a TrieMap<K,V> can be used anywhere Map<K,V> is required.  The tricky bit is informing thee TrieMap how to turn instances of K into a series of integers.  This has been done by supplying the constructor with an instance of TriePathIteratorFactory, thus:

public TrieMap(final TriePathIteratorFactory<K> pathIteratorFactory)

Where a TriePathIteratorFactory is responsible solely for generating instances of TriePathIterators:

public interface TriePathIteratorFactory<K> {
     public TriePathIterator<K> pathIterator(K k);
}

And the important part of the TriePathIterator is the generation of the series of integers:

public interface TriePathIterator<K> {

/**
* @return true if the receiver has more elements in it path, false otherwise
*/
public abstract boolean hasNext();

public abstract int nextIndex();

}

Now when the user calls a standard method from java.util.Map that takes a key (K), the TrieMap creates a TriePathIterator instance and uses it to traverse the data structure.

Using the underlying structure from the TrieMap I have implemented a TrieSet – simply a subclass of TrieMap<K,V> where K === V.  The performance exceeds that of the supplied java.util.TreeSet.  I will provide some hard data in another post, but the work-in-progress is available from the git repository on the ‘map’ branch.

Generic Java Trie

I have been looking at Tries again.  Thats right ‘Trie’.  Wikipedia has an excellent article.  Tries are used to sort data (usually strings) that can be found/retrieved/removed quickly. 

The idea is simple: imagine the list of names in your address book.  One page for all the names starting with ‘A’, one for all the names starting with ‘B’ and so on. 26 pages, and we can jump straight to the right page instead of looking at each name in turn.  But what if there are lots of names on the ‘S’ page? Simple, we create a section on the page for every second letter ‘(S)A’, ‘(S)B’ and so on.  Now when we have a name starting with ‘SM’ we jump to the ‘S’  page, look at the second letter and jump to the ‘M’ section.  Now imagine doing this for every letter – we would have lots of pages each with only one name on it.  That is a ‘trie’.

Why would we want to use a trie?  Well they are very quick to find names, or remove names, and if we work our way through all the pages we get an alphabetic list of all the names.

Why might we not want to use a trie? Well, for one they can be slower when we are adding names as we have a lot more work to do.  And for  small numbers of names finding and removing are not much faster than simple lists.  They also use more memory – though for most modern machines this may not amount to a significant difference.

The slower addition is the main drawback, but as usual things are not as simple.  If you do not want your list of names sorted, or you are happy to sort them each time you get the list, then this slower addition may rule out using tries. But if you want to get the sorted list often then the cost of sorting makes the decision much closer.  My initial experiments in Java showed that adding elements to a normal list and then sorting it took about the same time as adding the same elements to a trie.  Which is not surprising, since the task is about the same in each case.  The difference being that sorting subsequent additions to a list is a lot slower than adding extra elements to a trie.  And again this is obvious.  If I have a list of 10000 names and add 100, a list may have to sort 10100, but a trie only has to find the right place for each of the 100.

I have implemented a Trie  (licenced under GPL3) in Java (git repository) that adopts the java.util.Collection interface.  This means that it can be used interchangeably in your app wherever you have already used another collection.

I have a bunch of things I still want to do, and maybe I will get round to them:

  • Publish some performance figures.  Currently for large sets find/remove are about 1000 times faster than java.util.ArrayList.  But these times and the insertion costs are dependent upon how ‘dense’ the data is.
  • Demonstrate the use of the Trie to store data other than strings, for example Event objects with a comment String and Date, sorted by date component eg year,month,day,hour, minute, second,millisecond.
  • Sort out what to do with elements with duplicate paths – are they the same, or should they be added in a list?