Make all operations O(|key|+lgN)
#2
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
I'm new to Rust, so please forgive me if there are some memory-managment issues, or non-idiomatic constructs.
This contribution improves the time complexity guarantees of the data-structure.
Previously, the tree could've been arbitrarily skewed, in the worst case forcing you to perform Omega(|Sigma|*|key|) operations just to locate the key, for example for the set
{a,b,c,..,z,za,zb,...,zz,zza,zzb,...,zzz,...}. In particular loading the dictionary in lexycographic order would lead to skewed tree.This pull requests ensures that each internal node is balanced in the following sense:
where
|T|denotes number of keys in a treeT, which is|left|+|right|+|middle|+(if value.is_none() {0} else {1}).This condition is sufficient to ensure that the access time is
O(|key|+lg |root|)where|key|is number of characters inkey.To see this, observe that going left or right decreases the number of keys at least by 25%, and going to middle consumes one character of the key.
In order to maintain the balance, a variant of Scapegoat Tree technique is used: we view the whole ternary tree, as a recursive binary tree of binary trees of binary trees... i.e. we only look at the left&right links, and treat the middle link as attachment point for the next level of this recursive structure. In such binary tree the weight of a node itself (as opposed to: weight of whole subtree) is defined to be the
|middle|+(if value.is_none() {0} else {1}), and we try to balance these weights according to the same criteria defined earlier, just interpret it as if there was no middle child, and instead include middle subtree's weight in our own.A single insert or remove operation can be viewed as updating recursively several binary trees, modifying weight of one node by +1 -1 in each. This change can lead to inbalance in one or more of these binary trees. We treat each binary tree separately.
To restore balance in a binary tree we identify the Scapegoat: the unbalanced node closest to the root. We then rebuild the binary subtree rooted at the Scapegoat from scratch: we sort the nodes by in-order traversal, and reconstruct ideally balanced tree from the list of nodes, by choosing the "median" node (the one for which the total weight of nodes before it in the list is at most half the total, and same for the nodes after it) to be the root, and then subdivide the list recursively to build left and right children. This takes
O(|T|)time where|T|is the total weight of this binary subtree, because:O(n)wherenis the number of nodes, which can not be larger thanO(|T|)because even though an internal node can have weight 0, it is only permitted (by balancing rule) if it has two children, thus the number of nodes with node zero is bounded by the number of leaves, and each leaf has weight at most 1O(lg r)time, whereris number of nodes in the list, however, the list is pruned from nodes of weight 0, so can't be longer than its total weight, so the top-most binary search will takeO(lg|T|). Because, by construction left and right children will have <50% of the total weight, we can see that onk-th level of recursion there will be at most2^kbinary-searches takingO(lg|T|-k)each. Together this sums tosum_{k=0}^{lg|T|} 2^k(lg|T|-k)which we can recognize to be the same expression as the cost of the "heappify" construction of a binary heap, which is known to be O(|T|).This
O(|T|)cost of rebuild is amortized by the fact that in order for the Scapegoat to become unbalanced, it must have participated in a lot of inserts or remove operations since last rebuild.In particular, we can associate a potential
P(self)with nodeself:Right after rebuild, the
P(self)must be zero, becausemax{|left|,|right|} < |self|/2.Each insert or remove can increase or decrese
|left|,|right|and|self|just by +-1, and thus increases the potential at most byO(1)for the nodes along the update path, and we can afford paying for these potential changes without increasing the overalO(|key|+lg|T|)(amortized) cost of the update.Whenever we need to rebuild, we have
max{|left|,|right|}>=3/4|self|, and thusP(self)>=|self|, so it is sufficient to pay for the rebuild, because the rebuild will "discharge" the Scapegoat's potential bringing it back to 0.In order to implement above scheme, the
Node<T>must have a new fieldcount: usize, which increases the size of the node from 48 to 56 bytes. That's unfortunate, and I really tried to figure out a scheme similar to AVL, but I see no simple way to use just rotations to restore balance in a weighted tree.To make up for this cost, I've added a new feature, which is made possible thanks to this new field:
get_nth(n)which returns n-th key,value pair lexycogaphically.This means we now have a data structure which can serve not only as key-value map, but also as a indexable sorted array, and in particular: a heap/priority queue.
I've tried to split the changes into self-contained commits:
The first one reformats the code using
cargo fmtand adds the missing.gitignorefile.Then I simplify the
remove_rlogic a bit, although this might be subjective, I know. Unfortunatelly, I had to back off fromremove_leftmostthis idea later, because while it nicely ensures that there are no nodes without value nor middle child, it may severely distrupt the balance when stealling the descendant.Finally I implement the count, get_nth, and add rebalancing logic.