mike watkins dot ca : March 12 2009 Archives

March 12 2009

Job Losses in Canada

My own belief is if we were going to have some sort of big crash or recession, we probably would have had it by now. Stephen Harper, September 15 2008

Is it a good time now to buy stocks, Stephen Harper?

We have a cyclical downturn but nothing that requires major government intervention. Stephen Harper, March 1 2009

Tyrannical Databases

Inspired by a series of slides Michael Schurter published on Tokyo Cabinet and PyTyrant, I thought I'd code up his examples using another database which can use a key-value approach, Durus.

Durus is a ZODB work-a-like which allows for easy persistence of Python objects, not just values. It's simple, fast, and useful.

Here's the baseline Tokyo Cabinet db example Michael published, using the pytc interface:

import pytc
db = pytc.HDB()
db.open('test.tch',  pytc.BDBOWRITER | pytc.BDBOREADER | pytc.BDBOCREAT)
for i in range(256):
    v = chr(i)
    for x in range(256):
        db.put(chr(x), v)
        db.get(chr(x))

Running it:

$ time python test.py
real    0m0.168s
user    0m0.157s
sys     0m0.010s

And here is a Durus example, accessing a local file-based storage:

# Durus example 1 - File-based persistent dictionary
from durus.file_storage import FileStorage
from durus.connection import Connection

conn = Connection(FileStorage('test.durus'))
db = conn.get_root()
for i in range(256):
    v = chr(i)
    for x in range(256):
        db[chr(x)] = v
        db[chr(x)]
conn.commit()

Running it:

$ time python durus-test.py
real        0m0.197s
user        0m0.187s
sys         0m0.008s

Now lets change to client-server operation, delivering more or less the same abilities as PyTyrant/Tokyo cabinet. A minor change to durus-test.py gives us a client:

# Durus example 2 - Remote access to a File-based persistent dictionary
from durus.client_storage import ClientStorage
from durus.connection import Connection

conn = Connection(ClientStorage())
db = conn.get_root()
for i in range(256):
    v = chr(i)
    for x in range(256):
        db[chr(x)] = v
        db[chr(x)]
conn.commit()

In between each run we'll remove the database file. We'll need a server running, so in another terminal lets fire one up:

$ rm test.durus
$ durus -s --file test.durus

Run the second example:

$ time python durus-remote-test.py
real        0m0.204s
user        0m0.189s
sys         0m0.013s

Lets use a more advanced container than a persistent dictionary, a BTree. First Tokyo Cabinet/pytc:

import pytc
db = pytc.BDB()
db.open('test.db',  pytc.BDBOWRITER | pytc.BDBOREADER | pytc.BDBOCREAT)
for i in range(256):
    v = chr(i)
    for x in range(256):
        db.put(chr(x), v)
        db.get(chr(x))

Running pytc with the BTree:

$ time python test.py

real    0m0.169s
user    0m0.157s
sys     0m0.011s

Nice and fast - its all C-based.

Now the Durus BTree code:

# Durus example 3 - File-based persistent BTree
from durus.file_storage import FileStorage
from durus.connection import Connection
from durus.btree import BTree

conn = Connection(FileStorage('test.durus'))
root = conn.get_root()
db = BTree()
root['db'] = db
for i in range(256):
    v = chr(i)
    for x in range(256):
        db[chr(x)] = v
        db[chr(x)]
conn.commit()

Running this we see a significant performance delta compared to the C-based pytc/Tokyo Cabinet:

$ time python durus-btree.py
real        0m1.319s
user        0m1.308s
sys         0m0.011s

The delta will tip back into Durus's favour in the next two examples.

# Durus example 4 - client-server access to a persistent BTree
from durus.client_storage import ClientStorage
from durus.connection import Connection
from durus.btree import BTree

conn = Connection(ClientStorage())
root = conn.get_root()
db = BTree()
root['db'] = db
for i in range(256):
    v = chr(i)
    for x in range(256):
        db[chr(x)] = v
        db[chr(x)]
conn.commit()

First, the access the BTree-based "db" via client-server:

$ time python durus-remote-btree-adding.py
real        0m1.691s
user        0m1.681s
sys         0m0.010s

Next we see that read only access, remote or local, remains fast, even with the BTree structure:

$ time python durus-remote-btree-ro.py
real        0m0.054s
user        0m0.040s
sys         0m0.012s

PyTyrant / TokyoCabinet has a nice simple API to accessing the remote server:

import pytyrant

t = pytyrant.PyTyrant.open('127.0.0.1', 1978)
for i in range(256):
    v = chr(i)
    for x in range(256):
        t[chr(x)] = v
        t[chr(x)]

PyTyrant client-server access to a BTree structure suggests future room for improvement:

$ time python pyt-test.py

real    0m11.151s
user    0m1.317s
sys     0m1.653s

Of course raw throughput isn't everything. Durus has persistent container types including Dictionary, BTree, Set and Lists. Keys in mappings can be any hashable object; values can be any pickleable object. Durus objects are Python objects, not merely strings or values.

Consider the following:

$ durus -c
Durus 127.0.0.1:2972
    connection -> the Connection
    root       -> the root instance
>>> from durus.persistent_dict import PersistentDict
>>> names = PersistentDict()
>>> root['names'] = names
>>> connection.commit()
>>> mike = 'Mike Watkins'
>>> fred = 'Fred Astaire'
>>> ringo = 'Ringo Starr'
>>> names[1] = mike
>>> names[2] = fred
>>> names[3] = ringo
>>> names[22] = fred
>>> id(names[2])
3082202976
>>> id(names[22])
3082202976
>>> connection.commit()

When we reconnect, we should expect the values within the mapping at keys 2 and 22 to be the same object:

$ durus -c
Durus 127.0.0.1:2972
    connection -> the Connection
    root       -> the root instance
>>> names = root['names']
>>> id(names[2])
3081790720
>>> id(names[22])
3081790720
>>> id(names[2]) == id(names[22])
True

Of late there seems to be plenty of interest in non-SQL database architectures -- CouchDB, Tokyo Cabinet among others getting attention, in part because they offer a language agnostic solution.

For those many other times when a project will benefit from a persistence layer tightly coupled with the language, object databases like Durus or ZODB are worthy of consideration.