Wed, 13 Mar 2002 10:54:19 -0800 (PST)
On Wed, 13 Mar 2002, Elia Stupka wrote:
> I've been thinking that if we want to start using bioperl-db seriously
> over here we will be loading a lot of data, and data loading still seems
> relatively slow (talking about annotated files). Do you guys think there
> is room for improvement somewhere? The bioperl side of it? (parsers?) The
> biosql part of it (for fresh loads of data is it maybe too paranoyed at
> the moment?)
1) The entry-centric approach
One problem is that updating an entire database takes as long as - or
longer than - a fresh load. The loader script doesn't know if an entry
needs updated or not, so it goes ahead and updates anyway. BioSQL is very
entry-centric so we could take advantage of that.
What if we used the Bio::Index/Fetch stuff rather than a Bio::SeqIO loop,
cache the md5checksum of each entry either in the db or a seperate file,
and only update if it's different? Of course the fresh load will still be
slow, but once on the go it should be a lot faster, althugh I don't know
for sure, depends how much data churn is going on in the source dataset.
md5 is pretty fast, we could speed it up more by only checking md5 if the
entry byte lengths differ.
Another option is to slice the input fasta up across a cluster and have
every node bang on the database in parallel. Postgresophiles say pg scales
better this way but the ensembl pipeline seems to say otherwise.
If that's too slow, what about a local "bulkload" mysql db on every farm
node, splice the flatfiles, load up a bunch of independent dbs, when
that's done, load the main central one with some fast "INSERT INTO
main.tbl AS SELECT FROM nodeX.tbl" statements. Of course, now that biosql
is more normalised you need some semi-clever SQL to collapse duplicates
and redirect foreign keys. Perhaps tables like ontology_term could live on
the central db, with most of the others (which are rarely shared across
entries) getting loaded on the farm node db. It should actually be fairly
easy to manage this split by faking views in the BaseAdaptor layer.
Of course this only works so long as bioSQL is being used in an
entry-centric fashion. I think an embl/SPTR mirror will always be
entry-centric, so that's fine, but it's worth bearing in mind that the
above approaches wouldn't scale if you do something less entry centric;
e.g. collapse all the different entries in which a certain gene appears
(protein, cDNA clone, genomic dna of varying strains, EST...) into a
single gene entity.
2) Parser changes
Maybe RecDescent is too slow, but I think we could still design the new
SeqIO system with the same philosophy in mind. I think the best thing
about the RecDescent stuff is it allows a decoupling of the parsing, event
generation and object building. I hacked up Heikki's example to give an
example of this, where events are fired instead of objects created. You
can catch the events and build objects to give the exact same SeqIO
behaviour; but you could also catch the events and pass them directly to
database stored procedures.
Now we don't specifically need RecDescent to do this, we just need to
stick with the philosophy, and try and make the perl magic correspond to
platonic production rules. The decoupling is the important part. We could
even do the parsing / event firing in C. Parsing in C may seem a bit
mental when you're used to perl regexp magic, but maybe being forced to
less magic would be a good thing.
In fact, check out how fast 100,000 lines of perl parses. Is genbank
syntax really gnarlier than perl syntax? Does anyone know how the perl
compiler works - does it use the perl regexp engine? It may be an unfair
comparison, after all perl has had millions of crazy hackers optimising it
over the years (sounds familiar? maybe not such an unfair comparison).
Also, now that all these hardcore perl gurus are trying to get in on the
bioinformatics game maybe we could turn that to our advantage....?
How does the NCBI toolkit fare here?
3) Last resort
Try the biojava loader.
Language wars aside, it seems biojava folks tend not to care about
embl/genbank parsing so much, casually tossing that legacy aside and just
getting everything through pristine DAS or SOAP or J2EE Enterprising
Business Beans or WIDDL or whatever the latest buzz is (Sorry, didn't mean
to lump DAS in with all that nonsense). Which is fine, perl will probably
always be the workhorse when it comes to the messier older legacy format
type stuff I think the biojava chaps like it that way.
So you'll have all that done and the bioSQL mirror ready by, ooooh, next
tuesday yeh Elia?
> * http://www.fugu-sg.org/~elia *
> * tel: +65 874 1467 *
> * mobile: +65 90307613 *
> * fax: +65 777 0402 *
> Bioperl-l mailing list