[Bioperl-l] Bio::SeqIO::tigr

Jason Stajich jason at cgt.duhs.duke.edu
Mon Nov 3 10:35:35 EST 2003


Josh - would love to see it one way or another - I assume XML::Twig
doesn't give anything faster/less memory?  I have asked in the past if the
XML - Perl gurus could give us some hard and fast rules as to what is the
best set of tools to use.

I think we are okay with ugly and uncommented code iff you are willing to
contribute it and then work on cleaning it up.  Since you don't show the
code I'm not clear what is so DIFFERENT about your coding style and
whether or not that this truly incompatible.  The requirements we have for
something that would be a SeqIO module is they have to follow the
structure of SeqIO drivers, mainly they implement next_seq and write_seq
and inherit from Bio::SeqIO and use the inherited _readline or _print for
IO rather than <$fh> and print $fh.

You can contribute it be posting it to the list, asking nicely for CVS
r/w account, or submitting it as an enhancement to bugzilla.open-bio.org.
Looking forward to it.

-jason

On Fri, 31 Oct 2003, Josh Lauricha wrote:

> I've written a SeqIO parser for the tigr xml data format, and would like
> to contribute it to BioPerl. However, there are a couple things I don't
> really like about it but don't have the time to fix right now. Could I
> get some feedback from the list regaurding each?
>
> First, some background. Since each XML file is roughly 60MB, using the
> XML parsers provided by TIGR (using XML::Simple and XML::Sax, IIRC)
> takes around 7-10 minutes to parse (no including BioPerl object
> creation) and occationally used more than ~2.5GB of memory, which an x86
> can't handle.
>
> To get around this, I took advantage of the fact that these are machine
> generated and parsed the entire file using regexp, only storing what is
> "relavent" to retrieve a sequence. This means, the ~75 lines of code
> TIGR used is around 1280. However, it uses around 250MB of memory and
> (converting from TIGR to GenBank) runs in around two to three and a half
> minutes, 30-60% slower than GenBank -> GenBank convertion.
>
> 1) The code is pretty ugly. It was one of my first "large" perl projects
>    and reflects that. The uglyness is partially due to my inexperiance
>    at the time, and partially do to the ugliness of the problem.
>
> 2) Its not very well commented, ok its not commented. This isn't too big
>    a problem, as everything acts basically the same way, and once
>    someone understands that the rest is easy. (Its really just the same
>    thing over and over). Its just fairly bad form.
>
> 3) The memory usage (and runtime) could be improved by one or more of:
>    a) Storing everything directly into objects rather than a tree
>    b) Using arrays to store everything rather than hashes
>    c) Ignoring any tags that aren't actually used.
>
> 4) The coding style is nothing like the rest of BioPerl's. Mainly
>    because, I prefer this style (PERSONAL preference, no flames,
>    everyone gets their own oppinion). This is bad for a project,
>    but in all honesty if I need to drastically change my coding
>    style I will probably never get around to fixing up this code.
>
> 5) There is quite a long delay before anything is actually accessible
>    because the nucleotide data is given at the end of the files
>    (actually, at the end of an ASSEMBLY tag) so everything before it
>    needs to be parsed. This leads to the first ->next_seq() call taking
>    a significant time.
>
> Since I can't show you what the object looks like, I'll show you what
> the GenBank file looks like. An example of the genbank file is at:
>
> http://bioinfo.ucr.edu/cgi-bin/seqfetch.pl?database=all&accession=At1g03870
>
> Thanks for your time,
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list