[Bioperl-l] Bio::SeqIO::tigr
Aaron J.Mackey
amackey at pcbi.upenn.edu
Mon Nov 3 11:35:53 EST 2003
IANAXPG[1], but:
Leanest and meanest: XML::SAX event-based "push" parsing
Leaner and potentially very friendly: XML::SAX::PullParser
(unfortunately, not yet implemented)
Potentially lean and slightly friendlier to non-event-based
programming: XML::Twig (sub-DOM)
Fat and friendly: XML::Simple (full-DOM)
-Aaron
[1] I Am Not An XML-Perl Guru
On Nov 3, 2003, at 10:35 AM, Jason Stajich wrote:
> Josh - would love to see it one way or another - I assume XML::Twig
> doesn't give anything faster/less memory? I have asked in the past if
> the
> XML - Perl gurus could give us some hard and fast rules as to what is
> the
> best set of tools to use.
>
> I think we are okay with ugly and uncommented code iff you are willing
> to
> contribute it and then work on cleaning it up. Since you don't show
> the
> code I'm not clear what is so DIFFERENT about your coding style and
> whether or not that this truly incompatible. The requirements we have
> for
> something that would be a SeqIO module is they have to follow the
> structure of SeqIO drivers, mainly they implement next_seq and
> write_seq
> and inherit from Bio::SeqIO and use the inherited _readline or _print
> for
> IO rather than <$fh> and print $fh.
>
> You can contribute it be posting it to the list, asking nicely for CVS
> r/w account, or submitting it as an enhancement to
> bugzilla.open-bio.org.
> Looking forward to it.
>
> -jason
>
> On Fri, 31 Oct 2003, Josh Lauricha wrote:
>
>> I've written a SeqIO parser for the tigr xml data format, and would
>> like
>> to contribute it to BioPerl. However, there are a couple things I
>> don't
>> really like about it but don't have the time to fix right now. Could I
>> get some feedback from the list regaurding each?
>>
>> First, some background. Since each XML file is roughly 60MB, using the
>> XML parsers provided by TIGR (using XML::Simple and XML::Sax, IIRC)
>> takes around 7-10 minutes to parse (no including BioPerl object
>> creation) and occationally used more than ~2.5GB of memory, which an
>> x86
>> can't handle.
>>
>> To get around this, I took advantage of the fact that these are
>> machine
>> generated and parsed the entire file using regexp, only storing what
>> is
>> "relavent" to retrieve a sequence. This means, the ~75 lines of code
>> TIGR used is around 1280. However, it uses around 250MB of memory and
>> (converting from TIGR to GenBank) runs in around two to three and a
>> half
>> minutes, 30-60% slower than GenBank -> GenBank convertion.
>>
>> 1) The code is pretty ugly. It was one of my first "large" perl
>> projects
>> and reflects that. The uglyness is partially due to my inexperiance
>> at the time, and partially do to the ugliness of the problem.
>>
>> 2) Its not very well commented, ok its not commented. This isn't too
>> big
>> a problem, as everything acts basically the same way, and once
>> someone understands that the rest is easy. (Its really just the
>> same
>> thing over and over). Its just fairly bad form.
>>
>> 3) The memory usage (and runtime) could be improved by one or more of:
>> a) Storing everything directly into objects rather than a tree
>> b) Using arrays to store everything rather than hashes
>> c) Ignoring any tags that aren't actually used.
>>
>> 4) The coding style is nothing like the rest of BioPerl's. Mainly
>> because, I prefer this style (PERSONAL preference, no flames,
>> everyone gets their own oppinion). This is bad for a project,
>> but in all honesty if I need to drastically change my coding
>> style I will probably never get around to fixing up this code.
>>
>> 5) There is quite a long delay before anything is actually accessible
>> because the nucleotide data is given at the end of the files
>> (actually, at the end of an ASSEMBLY tag) so everything before it
>> needs to be parsed. This leads to the first ->next_seq() call
>> taking
>> a significant time.
>>
>> Since I can't show you what the object looks like, I'll show you what
>> the GenBank file looks like. An example of the genbank file is at:
>>
>> http://bioinfo.ucr.edu/cgi-bin/seqfetch.pl?
>> database=all&accession=At1g03870
>>
>> Thanks for your time,
>>
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list