HIERARCHIES (LISTS)

Page  1  |  2  |- 3 -|  4  

   
Alternative Schemes

 

The adjacency list method is difficult only in that recursion is used to read the indent/level information from the list. Some would propose a more 'elegant' solution where the levels, in a way, are fully laid out and available for a SQL query without resort to recursion; or at least with resort to flat looping/iteration, rather than recursion.

 
Celko Nested Sets

 

Author Joe Celko proposed the idea of - nested sets - and where he originally referred to his index structure as a "modified pre-order traversal tree" (the 'pre-order traversal tree' being something of a construct in theory). He imagined something like a shoelace (say, a string, even with 'knots' on it) being threaded though the 'eyelets' on either side of a category. It would snake down through subcategories, be 'threaded' back out, and continue down through the list above, and down to subcats, again, in the same way. So there is an 'eyelet' to the left of a category, and one to the right; a pair of index values uniquely identifying each entry in the outline, and reflecting its order in a list, to boot. In other words, the complete hierarchy is sort of 'flattened', laid out to view from any given category, and each category on its own level, its own list, is sorted in the order one chooses. The 'string' does all that, and holds everything together.

 
A
a__
   | b__
        | e
        | f__
             | k
        | g
   | c__
        | h
        | i
   | d
                    
B
Category  Down  Up
a__019
| b__110
| e23
| f__47
| k56
| g89
| c__1116
| h1213
| i1415
| d1718

 
 

You can see an example full outline on the left, Table A. Table B shows the index values, for each category, that reflect the outline structure, and the sort order in each list, and which also are used for retrieval. Clearly the 'string' is 'threaded' down, increasing by one each time, on the 'down' or left index until there are no more subcategories. So here you go, a, b, e, and there are no more subcats at e. Then it is brought back up to the previous level, then down that list, following subcats, as before, until you finally work back out to the last category in the outline. So e is the bottom of that sequence, move over to Up and then resume at the category following e. You can note that this creates an interval under each node, which takes in not merely the immediate sublist, but all sublists under that, as well. The node, b, for example, of (1,10) contains the list below, and the keys for the list below that (which is only the node, k).

But there is an obvious problem, here. Any time a category is deleted or added, the entire set of indices, or at least one of each pair, must be recalculated for - each item - in the entire hierarchy. That's fine for small outlines. But as the data grows, so does the time it takes to update the table. It's a 'volatile', and constantly changing key structure, in other words. That is, it's not a bad method if changes to the outline are infrequent. But changes, most any change, could potentially locked things up on the server if the lists became very long.

 
Tropashko Nested Intervals

 

Celko suggested a fix for the problem might be to widen the step when running down and back out. Instead of adding one each time, one could add 10 or 100, and so on. Then you would not have to always recalculate the entire set of indices. It's dirty. But it actually could work, and reduce the frequency of complete index recalculation. Author and Oracle whiz, Vadim Tropashko, noted that you could still bunch up in a particular list and that this 'widening' only put off a potential problem. He suggested, instead, a more flexible - (fractional) nested interval, using integer fractions (rational numbers - technically, any integer divided by any non-zero natural number), and binary exponentiation. And, as above, this method of 'tagging' the categories would allow for a fast, and non-recursive, retrieval of any and all information from the entire tree, without the 'volatile' maintenance overhead of nested sets.

Category   Path  Numer  Denom
a__ .132
| b__ .1.174
| e .1.1.1158
| f__ .1.1.22716
| k .1.1.2.15532
| g .1.1.35132
| c__ .1.2118
| h .1.2.12316
| i .1.2.24332
| d .1.31916

 
 

For reference, the path corresponding to some or any category is specified with integers, separated by periods, as shown (you could use any sort of separator, though). The denominator is obviously just a binary shift, 2 to the power of x, for the sum of the digits in the 'path'. For the numerator, the same is done, except that at each step of the way, a further amount is added or subtracted, depending. At each level (l, if you like), the value x at that level (between the periods), becomes the exponent for 2, left to right:

Denominator(2 ^ x) * running total for denominator
Numerator(2 ^ x) * running total for numerator - ((2 ^ x) - 3)

 
 

Both numerator and denominator are 'seeded', or started, at 1. So with "1.1.2", for example, the denominator is 2 to the sum of the digits, in this case - 2 ^ 4. As for the numerator, the first "1" gives 3 (2-(-1)), the second is 3 shifted plus 1 (-(-1)), for 7, and the "2" give that 7 multiplied by (2^2), or 4, and then one (4-3) is subtracted. However, it should be noted that there's obviously some potential problems with this method, as well. First, as with nested sets, a lot of overhead is involved in moving things around. Perhaps not every category key has to be updated. But all subcat and subcat under that, and so on, recursively, keys have to be modified, if things get moved around. It would be unavoidable with such methods (then again, the sort field in the adjacency list had to be updated, too). Second, and more importantly, the binary shifting doesn't really use a lot of numbers out of the numbers available to the system before it crashes into an overflow condition. It's a geometric progression, that at a certain point, explodes into gigantic floating point numbers, even before overflow is reached.

 
Tropashko Triangular Navigation

 

The designer suggested an alternative, instead. Combine the 'numer' and 'denom' calculations into a single integer key.

Category   Path  N (key)
a__ .11
| b__ .1.12
| e .1.1.14
| f__ .1.1.29
| k .1.1.2.118
| g .1.1.319
| c__ .1.25
| h .1.2.110
| i .1.2.221
| d .1.311

 
 

One property is fairly apparent. If the category is the first in its list, that is if the last digit in the Path is, 1, then a) the key is not only an even number, but b) the key for the superior category is exactly half. Those in the same list are different. You can see that N for the next in the list is double that for the category above, plus 1. So where ".1.1.1" is, 4, the next item down, ".1.1.2", is double that, plus 1, or 9. But the first item in the list just below, ".1.1.2" is twice 9, or 18.

Conversely to find the 'parent', or superior category, for any given (sub)category, one would first subtract 1 and then divide by 2. Then the result would be checked to see if it's an even number. If not, then repeat the process. But if it is an even number, then divide by 2, once more, and the result will be the key for the superior category.

It turns out there's a simple transform of the previous method, into this one:

N = Denom - (Numer+1)/4

 
 

The problems of before remain, however. While it's true that neither denominator nor numerator are calculated, the same calculation is performed on this single integer key. The same doubling is performed just going down a list, at the same level. And 500, 1000, or more categories at the same level isn't necessarily unreasonable.

 
Conclusion

 

It seems reasonable to conclude that the desire to 'flatten', to make unique keys for every outline category, involves a lot of maintenance overhead, no matter how you slice it, for just any addition or deletion to the tree. In addition, just trying to explicitly specify both the level and list order, the structure and the sort, in some fashion, can introduce new problems. The advantage, on the other hand, of the adjacency method is that lists of subcategories are 'hidden', isolated from all but the immediately superior category. One doesn't need to know the structure to add or delete items, except right at the affected categories. There's no overall list to maintain, and no need to update every subcat key, each time. The trade-off for all that greatly reduced cost of time in list maintenance overhead, is that perhaps the time to query an adjacency list could be improved were it either, or simultaneously, represented by one of the schemes, briefly outlined, above.

CONTINUE


 

 
 More to read:

Nested Sets Celko article (also here)
Nested Sets Spread Suggested interval widening
Nested Intervals Tropashko article
Tri-Navigation Tropashko note posted to Usenet (and subsequent message, and the one after that)
Farey Intervals Tropashko's refinement using generated Farey sequences, rather than exponentiation
Dennis Forbes Versatile High Performance Hierarchies in SQL Server
Links More links on the subject