Underappreciated Perl Modules--Tree::Trie
2013-09-09
Before the Perl community settled on the term “hashes”, many called them “associative arrays”. This was a common term for an array that looked up elements by a string among dynamic languages in the old days, such as Tcl.
Although the hash algorithm is the popular choice in modern times, it wasn’t always this way. Binary search trees were also popular, but have the disadvantage that they have a worst-case time of O(n)
, and an average time of O(log(n))
(where n is the number of elements). Hashes lookup keys in constant time; they stay the same speed no matter how many elements you have.
There’s a caveat to hashes though: they have large constant factors, so a binary search tree tends to be faster for a small number of elements. They also don’t come back in any obvious order unless you make them that way by alternative means.
Hashes also have a problem where the value a string hashes to may collide with another string’s hash. When that happens, the value is appended to the list. On lookup, we then have to walk that list. This means that while hashes are generally close to constant time, they actually have a worst-case time of O(n)
, just like binary search trees. Since hashes are used so much on Perl, it tends to be very careful about colliding hash values, but it’s still a weakness in the algorithm.
This is where the trie comes in (usually pronounced “try”). A trie will lookup keys in O(m)
time, where m is the length of the lookup string. Note that hashes actually have this limitation, too. It takes at least O(m)
time to hash the string so we can look it up.
Oh, and you can get back the keys in sorted order by doing a pre-order traversal.
Oh, and you can get back a list of keys that match a prefix with practically no additional overhead.
All else being equal, a trie will be on par or better than the speed of hashes. However, in Perl, not all is equal. Perl’s hash implementation is handled by the language internally at the C level. Any trie library you use will go through subroutine/method lookups, which will be substantially slower. If we were rewritting Perl from scratch, maybe you could argue for replacing hashes with tries. But if you were rewritting Perl from scratch, that’s probably pretty low on the list of changes.
But never mind that. Sorted returns and prefix matching are still pretty cool if you need that sort of thing.
Tree::Trie looks like a good place to start experimenting with tries. To get behavior equivalent to hashes, you need to use the methods marked *_data
. The other methods only store keys without a value.
Some example code
my $trie = Tree::Trie->new;
$trie->add_data( foo => 1 );
$trie->add_data( bar => 2 );
$trie->add_data( qux => 3 );
$trie->add_data( quuux => 4 );
say $trie->lookup_data( 'bar' ); # Prints "2"
my @qu = $trie->lookup( 'qu' );
# Prints:
# qux: 3
# quux: 4
say "$_: " . $trie->lookup_data( $_ ) for @qu;
$trie->delete_data( 'qu' ); # Deletes 'qux' AND 'quux'
That’s all I have to say on tries. If you have suggestions for future underappreciated Perl modules, feel free to leave a comment below.