I developed software to find a maximum common subgraph (MCS) given a
set of molecules represented as a chemical graph. It's called fmcs. My previous three
essays were about the background
of the MCS problem, introducing fmcs, and an example of when MCS is used.
What I didn't describe was the mental effort it took to develop this
program. This is the second time I've written code to find the
multiple structure MCS, and both times it took a couple of months and
put my head in a very strange place. You would think the second time
is easier, but it means that I spent more time adding features and
doing things that my first version couldn't begin to handle. (Did
someone just mutter "second system effect"? Pshaw!)
This time too I had a better understanding of the development process.
I think I know why it's so much harder than most of the software I
develop.
In this essay, I reflect on some of the reasons why this is a hard
problem to test and I consider how unit tests, or any other
incremental-based testing approach, are not well-suited to a certain
class of complex algorithm development. While unit tests can provide a
basic sanity check during development, they fail to provide the test
coverage which one might expect from applying them to algorithmically
simpler modules. Further, development methodologies built primarily on
unit tests, like test-first style test-driven development, aren't that
applicable. Instead, other methods, like system testing and a deep
understanding of the algorithm and possible problems, are
required. This strengthens my view, explored in
Problems with TDD
that "TDD is a weak method for developing those tests of confidence."
How would you implement an MCS algorithm?
Think about the problem for a moment. How would you write an algorithm
to find the largest common subgraph between two graphs?
Take your time. This really is a hard problem. I dug up some of the
early papers from the Journal of Computer Documentation. People ended
up simplifying the problem by, for example, not supporting rings.
Finished thinking?
You probably came up with a graph walking algorithm which tries
different match pairings and uses backtracking or lazy evaluation to
search all of graph space. The more mathemtically inlined might have
converted this into a maximum clique algorithm.
In both cases there are a lot of arbitrary decisions to make. Which
pairings should you investigate first? Are there times when you cann
prune the search space? Did you make any assumptions about the nature
of the chemical graph (eg, that it's topologically a planar graph)?
How would you test your MCS algorithm?
Back in the late 1990s, I almost never wrote automated tests,
and never wrote an extensive test suite. Nowadays I'm pretty thorough,
and use coverage-guided techniques to get good, extensive tests
through some semi-permanent API layer that will allow me to refactor
the internals without changing the tests.
I couldn't do that here.
There are a lot of heuristics, and some of them are only triggered
under unusual circumstances, after a bunch of combinitorial
possibilities. Outside of a few minor components, I couldn't figure
out how to write unit tests for the code.
I ended up pushing most of the testing into validation testing of the
complete system, which meant I wrote some 1,000+ lines of code
without strong testing. Moreover, I used a new approach to the MCS
problem, so the algorithm I was working on doesn't have the track
record of the standard clique-based or backtracking approaches.
So stress factors included not knowing if the algorithm would work,
and not being able to develop enough test cases to provide good
validation during development.
As a rule of thumb, it's easiest to fix bugs which are caught
early. Unit tests and evolutionary prototyping are two of the
techniques that people use to tighten the feedback loop between
specification, implementation, and valiation. I think another stress
factor is propotional to the size of the feedback loop.
What testing did I do?
I mean, I did have tests during development. I came up with a few
examples by hand, I did a substructure search of a large data set and
I verified that the MCS code found that substructure, but I know
that's not enough tests. I know this because after six weeks of
development and over 1000 lines of code, I spent another three weeks
doing a large amount of post-development testing, and found several
bugs. In the process of writing this essay I also found that four days
of that development work ended up making things slower, so I'll have
to remove it.
I did most of my tests based on the ChEMBL data: 10,000 random
pairs of structures, the k=2, k=10, and k=100 nearest neighbors, and
the k<=100 neighbors with similarities of at least 0.95, 0.9, and
0.8. I also did, at the end, tests based on the ChEBI structure
ontology. There were easily 20,000 different machine-generated test
cases, although in most cases I didn't know the expected results
beforehand.
What bugs did I find?
What bugs did I find? I think it's educational to characterize a few of them.
Typo caught by an assertion check
One of the simplest bugs was a poorly formatted string. I used "%d"
when I should have used ""%%%d". A bit of jargon for those in the
know; I generate SMARTS strings for the intermedate subgraphs. If
there are more than 9 open rings then the syntax goes from a single
digit closure number to a closure number like "%10". I forgot to
include the '%' for that case, and probably because the '%' was
already there for the format string.
This wasn't triggered by my random-pairs test nor my various
similarity-search based tests. Only when I ran through the ChEBI data,
did I get an assertion failure when RDKit refused to accept my SMARTS
string. That was the first time where I had a SMARTS with 10 unclosed
rings.
As it happens, this error could have been caught by a unit testing, as
some people practice unit tests. It's a four line function which takes
a string and a number. Testing it is trivial. I didn't test it because
I feel that testing directly against internal functions inhibits
refactoring. I prefer to test against higher-level, more stable APIs.
My view is that it's usually easy to come up with high-level test
cases which trigger a certain function call. But not in this case. The
MCS search algorithm, while deterministic, uses so many arbitrarily
defined values that I couldn't on my own come up with a test case with
at least 10 open ring closures. And even if I did, a new search
heuristic might change things so that only, say, 7 open ring closures
were needed.
I felt that my system testing and the assertion check would be
enough to identify if there was a problem, and it did. A low-level
unit test might have helped, especially as I still don't have a
specific test for that failure.
I think the right thing to do is add that failure as a specific test
case, and use monkey-patching to insert wrong code for a repeat of
that test case. The first one tests that the code is correct, and the
second tests that the test case is still exercising the code under
test.
Cross-validation testing
The first MCS papers are about as old as I am. Many people have
written implementions, although relatively few are are available to me
both at no cost and with no prohibitions on using it to develop a new
MCS algorithm. (Some commercial companies don't like people using
their software to write new software which is competitive to it, or
even to use their software for benchmark comparisons.)
I tested pairs of structures against SMSD and I
did more extenstive tests against Indigo's MCS
implementation.
This is cross-validation testing. It's a relatively rare technique
because the cost of producing multiple identical implementations
usually isn't worth the benefits. Even here the results aren't exactly
identical because of differences in how the toolkits perceive
chemistry, and more specifically, aromaticity. I ended up spending a
lot of investigation time staring at cases with different answers and
trying to figure out if it was a chemistry problem or an MCS algorithm
implementation problem.
I found the SMSD had a bug in one of its options, which I reported
back to the author. The code had been fixed internally but not pushed
to the outside world. Its default mode and fmcs matched quite well,
except for a couple of chemistry differences. The new version is out
now - I need to test it again.
The only problem I found in the Indigo code was a part of their setup
code which didn't check the timeout. That's also been fixed after I
reported it.
The cross-validation with Indigo found problems in my code. For
example, I was often getting smaller MCSes then Indigo. After looking
at them, I figured out that my code didn't correctly handle the case
when a molecule was fragmented after a bond was removed because its
bondtype wasn't in all of the structures.
Why didn't my hand-written test cases find it? None of them had a case
where there was a bond in the "middle" of a structure chosen as the
reference structure, and where the MCS was not in the first fragment.
My code usually got the right answer when using highly similar
structures, for obvious reasons. It was only the random pair testing
where the problem really stood out.
Could I create a simple unit test for this error? Perhaps, but it's
not easy. I don't know which of the two fragments will be first - it
depends on so many arbitrary decisions which could change as the
algorithm is improved. The only test I can think of for this was to
generate a diverse set of tests, make sure some fail if the code isn't
implemented correctly, record the results, and make sure that future
tests never find a worse (or better?) test case.
Bad heuristic to determine the maximum possible subgraph growth
Most of the MCS implementation is heuristics. There's a branch and
bound search, there's subgraph canonicalization to reduce the number
of substructure tests, and so on. Each of these is supposed to help
make the code faster.
One of the tests takes the current subgraph and the list of "outgoing"
bonds to see how much of the remainder of the graph is accessible for
growth from the current subgraph. The rest of the molecule might not
be accessible because it's on another fragment, but it also might not
be accessible due to an earlier decision to exclude certain parts of
the molecule from future growth. (My algorithm tests all subgraphs
which include a given bond, then all subgraph which don't include the
given bond.)
It took a couple of days to think of the algorithm, write the code,
and get it working. I then did the timing tests to find out it was 1%
faster, to get the same answers.
Does that mean my code worked? I only had a suspicion that the new
algorithm should be faster. Perhaps the overhead of searching for the
accessible atoms was too costly?
After looking at my implementation - a lot - I finally realized that I
told the algorithm to exclude the subgraph from consideration for
growth but had forgotten to also exclude the set of previously
excluded bonds from consideration. With two changed lines, the overall
performance doubled for the random-pairs case. Most of the time it's
faster and sometimes its slower, so it takes an aggregate of tests to
measure this correctly.
I don't think this heuristic could have been written as a unit
test. No, I take that back. It could have, but it would have required
some careful thought to set up. Not only was this part of the code not
"built for test", but setting up the right conditions requires a
mental understanding of the problem which I know I didn't have when I
was doing the development.
BTW, I am not saying that unit tests can't be used to measure
performance. Some algorithms have simple timing characteristics where
it's easy to say that the code must complete by a given time, or that
it must be at least twice as fast as a reference
implementation. Occasionally these tests will hiccup due to unusual
loads on the test machine, but not usually enough to be a problem.
Indeed, the fmcs code has some unit tests for the timeout code. I
found a pair of structures which takes over 10 seconds to find the
MCS, I set the timeout to 0.1 second, and assert that no more than 0.5
seconds elapsed during the function call. (I don't require a high
precision for this code.) I do worry though that future changes to the
MCS search code might speed things up by a faster of 20. On the other
hand, the effect of broken timeout code is very obvious when running
the validation suite, so I'm not going to worry about it.
As is widely acknowledged, this stretches the idea of a "unit test."
Unit tests are supposed to be fast; preferably several hundred or more
per second. The goal is that these tests run often, perhaps even after
every save. Stick in a few 0.1 second timeouts into the system and it
bogs everything down, which discourages people from running the tests
so often.
But let's go back to this new code. On average, over a large number of
tests, the performance is 50% faster. What's a good test case? Can I
pick one structure or a small set of structures? Does the speedup
occur only when there are large molecules? Only for molecules with
lots of internal symmetry? Only for those which are easily
fragmented?
Only now, when writing this essay, did I find good test cases. When I
use a 10 second timeout using the k=10 nearest neighbors tests, I get
8 timeouts in the first 32 records using the old algorithm, and only 1
timeout using the new algorithm. The time for the very first test goes
from over a minute (I finally gave up) using the old algorithm to 0.08
seconds using the new one.
Obviously (in retrospect) the right solution is to pull out a couple
of those from my validation suite and turn them into test cases.
I never would have come up with these test cases before implementing
the new code - up until ten minutes ago I thought that the new code
only added a factor of two the code, not possible factors of 1,000!
Canonicalization: an in-depth analysis
I'm not finished with testing. I haven't fully characterized all the
parts of the implementation. For that matter, there are heuristics I
can still add. Here I'll describe what I did to analyze subgraph
SMARTS canonicalization. I conclude by finding that while it works, it
slows down the code and several days of development work ought to be
removed.
My MCS algorithm enumerates subgraph of one structure, converts that
into a SMARTS string, and tests if the pattern exists in other
structures.
There are many alternate ways to make a SMARTS string from a
subgraph. Given the linear chain of C O N, there's the reasonable CON
or NOC variations, the more confusing O(C)N and O(N)C, and crazy
variations like C1.N1O and O%987.N7.C%98. It's easy to make the
non-insane versions be generating a spanning tree. The question is
where to start the search and how to handle branches.
I believe that there are many duplicate patterns in a query. For
example, a benzene ring will generate many "ccc" queries. I can
minimize the number of substructure matches by caching earlier SMARTS
match results. But caching doesn't work if sometimes I get a CON and
sometimes I get a NOC. The solution is to generate a "canonical"
SMARTS for the subgraph - a unique representation for all of the
variations.
I implemented the CANGEN algorithm from
SMILES. 2. Algorithm
for generation of unique SMILES notation by Weininger, Weininger, and Weininger
J. Chem. Inf. Comput. Sci., 1989, 29 (2), pp 97-101 in order to choose
a canonical SMARTS. (But see
Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples, Neglur, Grossman, and Liu, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla.)
Canonicalization is notoriously hard to get right. It too has a lot of
tricky corner cases. For example, in the late 1990s Daylight tracked
down a reported bug in their canonicalization implementation. They
algorithm requires a stable sort algorithm, but they used an unstable
sort. Their own internal testing never found the problem - it was a
customer who found the Solaris and IRIX boxes occasionaly returned
different values.
There could be similar bugs in my canonicalization algorithm. I don't
know. The usual solution is to generate large numbers of test cases,
randomize the order of the atoms and bonds, and verify that all of
them are the same. Here too it may be that only certain rare
topologies, or rare bond patterns, triggers an error. I'm not
convinced that I could come up with the right set of test cases for
it.
What makes it harder is that the underlying search algorithm doesn't
need a canonical SMARTS, so a canonicalization failure doesn't ever
show up in the output. I could have an infrequent bug and not notice
it! Canonicalization is only there for performance reasons, but my
implementation is in Python, while the substructure matching is in
C++, so it might even be that the canonicalization time might be more
than the time saved.
I can make that a bit more nuanced. Canonical SMARTS generation has
three phases: 1) initial ranking, 2) canonicalization/tie-breaking,
and 3) SMARTS generation. I can disable the canonicalization step and
get a "semi-canonical" SMARTS (my initial ranking is more like a
circular fingerprint of diameter 3 than the atom characteristic from
Weininger et al.). I can also disable caching. Both of these are
doable with only a couple of changes to the code.
This leads to three cases: canonical SMARTS with caching,
semi-canonical SMARTS with caching, and semi-canonical SMARTS without
caching. (There should be a fourth step, which is arbitrary SMARTS,
but that requires more extensive changes.)
What I want to understand most is the effect of canonicalization on my
code. Again, I have to resort to aggregate timings and statistics
across a set of benchmarks. I used the ChEMBL-13 data set and
generated various benchmarks. One is based on computing the MCS
between 10,000 pairs selected at random, another contains the k=2,
k=10, and k=100 nearest neighbors of randomly chosen fingerprints
(timing 500 data sets each time), and the last is the total of 500
tests of the k<=100 compounds with Tanimoto score of at least 0.95
to randomly selected fingerprints. My results (times are reproducible
to within a few percent) include the number of unique SMARTS
(canonical or non-canonical) generated and the total number of
substructure tests ("SS") which were carried out.
| Random pair timings | # unique SMARTS | # SS tests |
| no cache | Total: 10000/467.3s (21.4/s) Complete: 9997/437.3s (22.9/s) Incomplete: 3/30.0s | 921666 | 1339292 |
| semi-canonical | Total: 10000/421.3s (23.7/s) Complete: 9997/391.3s (25.6/s) Incomplete: 3/30.0s | 921666 | 921666 |
| canonical | Total: 10000/442.1s (22.6/s) Complete: 9997/412.0s (24.3/s) Incomplete: 3/30.0s | 709636 | 709636 |
| k=2 nearest neigbhors | # unique SMARTS | # SS tests |
| no cache | Total: 500/287.3s (1.7/s) Complete: 484/127.1s (3.8/s) Incomplete: 16/160.2s | 264057 | 320238 |
| semi-canonical | Total: 500/276.7s (1.8/s) Complete: 484/116.6s (4.2/s) Incomplete: 16/160.2s | 264057 | 264057 |
| canonical | Total: 500/298.6s (1.7/s) Complete: 483/128.4s (3.8/s) Incomplete: 17/170.2s | 186682 | 186682 |
| k=10 nearest neigbhors | # unique SMARTS | # SS tests |
| no cache | Total: 500/520.5s (1.0/s) Complete: 471/230.0s (2.0/s) Incomplete: 29/290.5s | 447648 | 3525563 |
| semi-canonical | Total: 500/490.1s (1.0/s) Complete: 472/209.5s (2.3/s) Incomplete: 28/280.6s | 447648 | 2872040 |
| canonical | Total: 500/509.3s (1.0/s) Complete: 471/218.7s (2.2/s) Incomplete: 29/290.6s | 330010 | 2109332 |
| k=100 nearest neigbhors | # unique SMARTS | # SS tests |
| no cache | Total: 500/414.4s (1.2/s) Complete: 486/271.4s (1.8/s) Incomplete: 14/143.0s | 128781 | 9932263 |
| semi-canonical | Total: 500/363.5s (1.4/s) Complete: 487/230.8s (2.1/s) Incomplete: 13/132.7s | 128781 | 7456877 |
| canonical | Total: 500/361.8s (1.4/s) Complete: 488/239.5s (2.0/s) Incomplete: 12/122.4s | 107648 | 6196764 |
| k<=100 at or above Tanimoto threshold of 0.95 | # unique SMARTS | # SS tests |
| no cache | Total: 500/642.1s (0.8/s) Complete: 458/220.3s (2.1/s) Incomplete: 42/421.9s | 468661 | 4388074 |
| semi-canonical | Total: 500/624.5s (0.8/s) Complete: 461/232.8s (2.0/s) Incomplete: 39/391.7s | 468661 | 3828813 |
| canonical | Total: 500/640.9s (0.8/s) Complete: 460/239.2s (1.9/s) Incomplete: 40/401.8s | 364802 | 3222366 |
Whew! That's a lot of data to throw at you. Please believe me when I
say that it took two days to generate correctly. BTW, "complete" means
that the MCS search algorithm went to completion, the "incomplete"
means that it timed out - here after 10 seconds - and gave a partial
solution.
What does this tell me? Obviously, caching is good. As expected, the
"semi-canonical" solution is always better than the "no cache" case,
and the canonicalization always reduces the number of substructures
and substructure tests. This means the canonicalization code is doing
something right.
Unfortunately, it seems like canonicalization has a big performance
impact. In the pairwise tests I do 922,000 canonicalizations in Python
to save 212,000 substructure tests, and I lose 21 seconds in doing
so. For the k=100 nearest neighbor benchmark, which is the only one
where the canonicalization code saves time, I do 129,000
canonicalizations to save 1,260,000 comparisons, and gain about 2
seconds. This suggests that each canonicalization takes as long as 10
substructure matches, and that RDKit can do 60,000 SMARTS matches per
second. A quick checks using one SMARTS gives 100,000 SMARTS matches
per second, which helps validate this estimate.
This means I should take out the canonicalization until it can be
implemented in C++. Granted, there may be a bug in the code, like
there was with an earlier hueristic. But there would have to be an
order of magnitude performance increase for this to be effective, and
I don't think that's likely.
Unit tests aren't enough to drive development
Which leads me back to my thesis. It would have been possible to
develop a few more unit tests for the canonicalization code. There
would have been some extra scaffolding in order to do that, but that's
a minor cost. I suspect that many of the tests would be very
implementation-dependent, and tied to specific internal function
calls. I don't like this because it means that a re-implementation of
this internal component in C++, which submerges many of the internal
helper functions and places them out of reach of Python-based unit
tests, would cause the unit tests to be un-runnable. And I am certain
that those unit tests would not be rigorous enough to be confident
that the code was working correctly.
Think about the question "should I write this code in Python or C++?".
It's a development-time question. Test Driven Development (TDD) is a
development methodology which uses unit tests to help make
development-time decisions. I think TDD is most helpful when
used to establish the minimum requirements for a project.
I don't see how TDD is helpful here. I can't think of any unit tests
which would guide this development decision. Sometimes you can write a
"spike solution" (also called a "prototype"):
A spike
solution is a very simple program to explore potential
solutions. Build the spike to only addresses the problem under
examination and ignore all other concerns.
This sounds good, but what development style should you use to develop
the spike solution? Spikes are supposed to be throw-away code, but I
can't figure out how something other than a complete, tested
implementation of the canonical algorithm or the remaining growth
heuristics would have been useful. After all, a two line change in the
latter went from "working but 1% faster" to "working and 50% faster",
and I didn't even know if there was going to give a speedup in the
first place.
If not TDD, what methodology do I use?
Read Peter Seibel's Unit
testing in Coders at Work. The author interviewed various famous
programmers and learned more about how they tested their software. You
may also want to read the related comments on Hacker
News.
Seibel recounts that Joshua Bloch "wrote a monstrous 'basher'" which
found "that occasionally, just occasionally, the basher would fail its
consistency check." Bloch then wrote tests to pin down the failure,
eventually finding the fault in a system mutex. Note that under normal
unit test development you assume that your underlying components are
correct, so this isn't something you would normally code for.
Seibel also reports that others assemble a large amount of test cases
and uses that to check their code. This is of course the technique I
used for the MCS problem.
It feels as trite as saying that the secret to losing weight is diet
and exercise, but my method for programming is to understand the
problem, implement it, and test it until you are confident that it's
good enough. That knowledge of how to do that comes from experience,
practice, reflection, and discussion.
For those who scoff and call this "big design up front," I merely
point you to earlier parts of this essay. When I started this code I
had a rough idea that it would work. I had previously implemented substructure
enumeration, and the Weininger et al. canonicalization algorithm
and SMARTS generation, so I knew that the components were possible,
but I didn't know how they went together. Instead, I let the code
development itself guide me. I thought some, I implemented code, I
"ran the code in my head", I reflected on what the tricky parts were,
I thought about how to make them more clear, and I tried various
was to improve the code.
That was a lot to keep in my head, I wasn't sure if the result would
be fast enough, and I had no way good way to test its usefulness until
most of the code was in place. No wonder it was mentally taxing!
I believe tests - including unit tests implemented during development
- are important. I don't believe that unit tests are good enough to
guide complex algorithm development. However, most programming (over
95%?) is not complex algorithm development; complicated? yes, but not
complex.
I once jokingly said that TDD is not useful if there's two or more
embedded for loops. That's a bit of a simplification, but not far from
my feelings. Some classes of problems, like complex algorithm
development and security analysis, require a different attitude
towards programming, emphasizing "what can go wrong?" and "how can I
improve my confidence that the code is working correctly?" It's my
view that this doubt-based philosophical attitude is missing from most
discussions of software development practices.
And perhaps as fundamental, when I have to be that critical of my own
work and probe it for failures and be open to the possibility of nasty
gotchas lurking in the deep dark corners, then some of that doubt
passes over into my personal life. I empathize with the idea of not
living in that "strange place" all the time, and I can see why this
isn't a common attitude.
Comments
This essay is meant to be a thoughtful reflection on the difficulties
of programming, using the MCS problem to provide specific
structure. If you want to leave
comments about the MCS portion then use the MCS comment site.
Otherwise, leave
a comment on testing hard algorithms.