The most recent authoritative references are here:
Tokyo Tyrant (actually Tokyo Cabinet – the storage engine) supports various types of storage – B+ Tree indexing, hash index, etc. This is configured by setting the filename or file extension to a particular value:
- If the name is "*", the database will be an in-memory hash database.
- If it is "+", the database will be an in-memory tree database.
- If its suffix is ".tch", the database will be a hash database.
- If its suffix is ".tcb", the database will be a B+ tree database.
- If its suffix is ".tcf", the database will be a fixed-length database.
- If its suffix is ".tct", the database will be a table database.
Tuning parameters can trail the filename, separated by "#". Each parameter is composed of the name and the value, separated by "=". For example, "casket.tch#bnum=1000000#opts=ld" means that the name of the database file is "casket.tch", and the bucket array size is 1000000, and the options are large and deflate.
For disk-based storage, several tuning parameters specify the on-disk layout while others specify memory and caching settings. Changing the on-disk layout requires scanning and re-writing the database data file which requires exclusive access to the file – which means taking the database offline. This scanning and re-writing process is done via tools provided with the distribution (ex: tchmgr and tcbmgr). Changing the memory and caching settings only requires a restart of Tokyo Tyrant.
We've been working only with on-disk storage via the hash and B+ Tree database engines. For a hash database the tuning parameters for the on-disk layout is limited to the size of the bucket array and the size of an element in the bucket array (choosing 'large' gets you 64-bit addressing and addressable data greater than 2GB). When a hash database file is first created, space is allocated on disk for the full bucket array. For example a database with 100M bucket size and 'large' option would start out at around 800MB. This region of the data file is accessed via memory mapped IO. There is an additional 'extra mapped memory' setting which default to 64MB – I'm not sure what this is used for, but for performance more memory is always better.
For a B+ Tree database, there are additional tuning parameters for the structure of the B+ Tree – how many members (links to child nodes) in an interior non-leaf node and how many members in a leaf node. Records are not stored in the B-Tree leaf nodes, but within 'pages'. The leaf nodes point to these pages and each page holds multiple records and is accessed via an internal hash database (and since this is a B+ Tree the records within a page are of course stored in sorted order). There is also a parameter for the bucket size of this internal hash database. One subtle detail is that the bucket size for a B+Tree database is the number of pages, not the number of elements (records) being stored – so this would likely be a smaller number than a hash database for the same number of records.
I've not yet figured out how the dfunit tuning parameter works or what impact that has on a running server, but it looks interesting.
- In memory hash
- bnum
- the number of buckets
- capnum
- the capacity number of records
- capsiz
- the capacity size of using memory. Note - records spilled the capacity are removed by the storing order.
- capnum
- the capacity number of records
- capsiz
- the capacity size of using memory. Note - records spilled the capacity are removed by the storing order.
In memory tree
- opts
- "l" of large option (the size of the database can be larger than 2GB by using 64-bit bucket array.), "d" of Deflate option (each record is compressed with Deflate encoding), "b" of BZIP2 option, "t" of TCBS option
- bnum
- number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 131071 (128K). Suggested size of the bucket array is about from 0.5 to 4 times of the number of all records to be stored.
- rcnum
- maximum number of records to be cached. If it is not more than 0, the record cache is disabled. It is disabled by default.
- xmsiz
- size of the extra mapped memory. If it is not more than 0, the extra mapped memory is disabled. The default size is 67108864 (64MB).
- apow
- size of record alignment by power of 2. If it is negative, the default value is specified. The default value is 4 standing for 2^4=16.
- fpow
- maximum number of elements of the free block pool by power of 2. If it is negative, the default value is specified. The default value is 10 standing for 2^10=1024.
- dfunit
- unit step number of auto defragmentation. If it is not more than 0, the auto defragmentation is disabled. It is disabled by default.
- mode
- "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock
Hash
- opts
- "l" of large option,"d" of Deflate option,"b" of BZIP2 option,"t" of TCBS option
- bnum
- number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 32749 (32K). Suggested size of the bucket array is about from 1 to 4 times of the number of all pages to be stored.
- nmemb
- number of members in each non-leaf page. If it is not more than 0, the default value is specified. The default value is 256.
- ncnum
- maximum number of non-leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 512.
- lmemb
- number of members in each leaf page. If it is not more than 0, the default value is specified. The default value is 128.
- lcnum
- maximum number of leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 1024.
- apow
- size of record alignment by power of 2. If it is negative, the default value is specified. The default value is 8 standing for 2^8=256.
- fpow
- maximum number of elements of the free block pool by power of 2. If it is negative, the default value is specified. The default value is 10 standing for 2^10=1024.
- xmsiz
- size of the extra mapped memory. If it is not more than 0, the extra mapped memory is disabled. It is disabled by default.
- dfunit
- unit step number of auto defragmentation. If it is not more than 0, the auto defragmentation is disabled. It is disabled by default.
- mode
- "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock
B-tree
- width
- width of the value of each record. If it is not more than 0, the default value is specified. The default value is 255.
- limsiz
- limit size of the database file. If it is not more than 0, the default value is specified. The default value is 268435456 (256MB).
- mode
- "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock
Fixed-length
- opts
- "l" of large option,"d" of Deflate option,"b" of BZIP2 option,"t" of TCBS option
- idx
- specifies the column name of an index and its type separated by ":"
- bnum
- number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 131071. Suggested size of the bucket array is about from 0.5 to 4 times of the number of all records to be stored.
- rcnum
- maximum number of records to be cached. If it is not more than 0, the record cache is disabled. It is disabled by default.
- lcnum
- maximum number of leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 4096.
- ncnum
- maximum number of non-leaf nodes to be cached. If it is not more than 0, the default value is specified. The default value is 512.
- xmsiz
- size of the extra mapped memory. If it is not more than 0, the extra mapped memory is disabled. The default size is 67108864.
- apow
- size of record alignment by power of 2. If it is negative, the default value is specified. The default value is 4 standing for 2^4=16.
- fpow
- maximum number of elements of the free block pool by power of 2. If it is negative, the default value is specified. The default value is 10 standing for 2^10=1024.
- dfunit
- unit step number of auto defragmentation. If it is not more than 0, the auto defragmentation is disabled. It is disabled by default.
- mode
- "w" of writer, "r" of reader,"c" of creating,"t" of truncating ,"e" of no locking,"f" of non-blocking lock
Table
6 comments:
Hello,
Very Interesting post. thanks.
But I have one question, Right now we are working with Tokyo Cabinet API. All is working wonderfully, very fast (200 000 entries fetched in less than 1 second).
But when switching to Tokyo Tyrant, using the same data file, performance are about 200 000 entries inserted in 20 secs and fetched in 20 seconds...very very bad performance, client is in the same machine as server.
Could you provide some more details on how did you tune your TC data base ? and details about performance?
Thanks
Ziied,
I apologize for not noticing your post sooner. Be sure to post your question on the Google Groups discusson forum as well - http://groups.google.com/group/tokyocabinet-users.
We have a few use cases for a key-value store like Tokyo Tyrant. For one, we expect to have 50M-100M records and do mostly inserts, occasional reads and very few updates. For this, we've chosen the B+ tree storage, with parameters of bnum=100000000 nmemb=256 and lmemb=512. The ttserver runtime parameters for caching are ncnum=10240 and lcnum=65536. Also we use an xmsiz=536870912
Thanks for your replay.
We will play with thoses parameters to see if performance will be better.
I've been looking for these params. Thanks. BTW, where did you find them? Did you just deep dive the code and figure that out.
Hi Mike,
Thanks for your nice post. However, as you said "If it is "+", the database will be an in-memory tree database.", is it for BDB ? Because, I replace:
BDB bdb = new BDB();
bdb.open("casket.tcb",BDB.OWRITER|BDB.OCREAT))
in place of casket.tcb, I put +, but still it is creating + file. Can you kindly help me about it ?
Regard's,
Arup
Hi Mike,
Thanks for your nice post. However, as you said "If it is "+", the database will be an in-memory tree database.", is it for BDB ? Because, I replace:
BDB bdb = new BDB();
bdb.open("casket.tcb",BDB.OWRITER|BDB.OCREAT))
in place of casket.tcb, I put +, but still it is creating + file. Can you kindly help me about it ?
Regard's,
Arup
Post a Comment