For one part of our system, I'm interested simply in faster inserts to the index. Having more indices for each table would help in other areas but I haven't started working on those yet. The table in question manages a many-to-one mapping of tokens (text) to internal database row IDs (integers). Essentially, it's a simple key/value lookup table. Currently a single machine does 150 inserts per second, which is fairly abysmal for a table with only 200M records.
This system uses MySQL and the InnoDB storage engine which uses a B-Tree indexing approach. Since the lookups are simply key/value lookups there really is no need for a B-Tree - a hash indexing scheme would work, but unfortunately InnoDB does not support that. We may migrate to Tokyo Tyrant and Tokyo Cabinet since that supports hash indexing and also seems to have better concurrency support (many non-blocking concurrent requests). The keys inserted are almost sequential, but from what I've read the sequential nature of the inserts should help (although the bottom-up building of a B-tree could be optimized by detecting that the key causing a page split is greater than all the values in the page being split - I think Oracle does this).
While looking for hash based indexing for MySQL I found TokuDB which uses the sexy phrase 'fractal tree' for their indexing. Their site has a few introductory whitepapers but a for deeper understanding I wanted to get to the theoretical foundation of their technology.
I soon found that there are several meanings to 'fractal tree' data indexing
- fpb+ tree. Fractal pre-fetching b+ trees, embeds cache-optimized trees within disk-optimized trees. Unrelated to TokuDB
- Fractal Trees (TM) from TokuTek. Uses 'cache oblivious' algorithms to improve B-tree usage.
- http://supertech.csail.mit.edu/cacheObliviousBTree.html
- http://www.cs.sunysb.edu/~bender/pub/sicomp05-BenderDeFa.ps
- http://www.cs.sunysb.edu/~bender/pub/FOCS03-co-searching.ps
- http://www.cs.sunysb.edu/~bender/pub/locality-full.ps
- http://www.cs.sunysb.edu/~bender/pub/BenderHu-TODS07.pdf
Although I haven't finished reading all the papers, what I like about this approach is the way they model algorithm performance as the cost to transfer blocks of data between layers of storage and also that they consider multiple layers. The most important transfer costs being between disk and memory, but including considerations like an OS managed disk cache is good. This models the multi-layer caching that nearly all large scale systems use. We are big fans of measuring algorithm performance and selecting the right tool for the right job.
As an aside, one of the comments on the MySQL performance blog pointed me to InfoBright, a column-oriented store, which might be useful in some of our analytics systems for ad-hoc reporting.
No comments:
Post a Comment