mike watkins dot ca : June 29 2004 Archives

June 29 2004

Sometimes too much open source...

Noted in Daily Python-URL : An acquisition, search and retrieval system based on Zope/Plone (PDF).

The case study follows a group as they examined, prototyped / discarded approaches, and eventually implemented a relatively high volume image capture solution. I say relatively since 380,000 “documents” is fairly trivial for any document management system worth its salt.

Never the less, I found this case study quite interesting since its in the field I spent most of the last decade in.

My quick conclusion is that they over-relied on open source tools to get the job done; I also wonder about the choice of Zope for such a first application. Its almost as if they were trying to force fit a solution.

While its true that the client may well inherit all the additional fab features of Zope down the road or free, in my experience many purpose-built image capture solutions are either best-off left that way, or are doomed to remain that way for financial or implementation constraints.

In either case, my appproach would have been to identify quick hit areas where the task could be done without the task into a bigger project than required.

First off, there is no way ZODB on its own would be able to manage millions of objects. The images have to be pulled out of the ZODB component of Zope. That’s a given, and the case study underlines this. Years of experience with object databases in the imaging world have taught us this – file systems are better for these objects. In particular since the base image object itself will never change there is no point putting it in a data store which is designed more for dynamic objects rather than raw scalability.

Commercial product vendors tried using the “blob” object in a database approach in years past and failed. It made storage management tricky then and its no less tricky now. A filesystem is much more robust than a huge ZODB file sitting on a filesystem.

My own tests with relatively small objects (stock transaction data) showed that ZODB, with or without ZEO, just wasn’t suitable for millions of records. And I have a hard time conceiving of any content system using ZODB as a result, since the base architecture of any content management system should scale from very small to very large in my opinion.

What Changes Would I Make

Certainly I would consider swaping out the open source components of the image capture system and use instead a commercial application such as Kofax Document Capture. Its a robust, batch oriented capture solution with plenty of hooks for systems integrators to bold on data lookup and validation routines.

In the case study the authors state that the open source solution delivered huge productivity gains over the proprietary standalone application (for Windows). I’m not satisfied by that statement at all – I want to know what the specific weaknesses of the Windows application were.

In volume image capture the key areas are:

To put some context to this, I was once involved in a project where we had to rapidly deliver a very high volume, scalable solution for a securities commission. The basic business need was simple: it was Februrary and the commission was moving offices in September. Their existing records storage room occupied almost an entire floor of their existing building, and the new building could not accomodate a records centre of the same magnitude without a million dollar investment in structural change and tennant relocation.

Estimation of the document conversion project came to many millions of documents to convert. Tight deadline, big project.

Document throughput per day is entirely dependent on the amount of data capture that has to be done in order to index the document. My guess is this application required a fairly minor amount of data capture – probably most or all of it would be obtained through performing a lookup against existing data stored in other business sytems, and simply highlighting and accepting the related metadata. Most frequently, a meaningful document date might be needed.

The amount of metadata required also depends on where in the business process lifecyle the documents are captured. The case study does not go into this, but it does read as if this is an after the fact type system where correspondence and/or forms are being stored as backup to business processes transacted on arrival of these documents.

If this is true, the type of solution should most definitely be geared to the lowest cost, fastest to implement solution possible, since these documents are not nearly as high value as they are at the start of the business process. Simple archival solutions rarely have a high business payback. While sometimes neccessary or desirable, they don’t deliver big value.

Better yet, a workflow-driven solution where the document images themselves or in part spawn business processes is more likely to deliver higher value to the organization and its clients.

If I were to do a project like this today, I would most certainly avail myself of open source solutions for the storage and presentation layers – a high reliance on UNIX or Windows file systems and backup technology; Apache webserver; a Python-based web framework – Quixote would be ideal in my view. I would use Kofax Capture for the imaging front end. One scanner, and a backup, would suffice. Multiple operaters performing indexing duties.

The entire solution could be put together – end to end – in a couple of weeks. Productivity would be extremely high; and the investment in custom Python development would be fairly minimal. An integrator with a their own toolkit built around a simple web application framework, like Quixote, would probably have most of the needed components ready to piece together rather rapidly.

Keeping the initial cost and complexity low is key in these types of projects – I’ve been there, done that. Once you get one done, if you’ve not overburdened the organization in financial or time committment sense, they will keen to look at additional areas to extend the first steps.