[Bioperl-l] Bio::SeqIO::tigr

Mon Nov 3 11:35:53 EST 2003

IANAXPG[1], but:

Leanest and meanest: XML::SAX event-based "push" parsing
Leaner and potentially very friendly: XML::SAX::PullParser  
(unfortunately, not yet implemented)
Potentially lean and slightly friendlier to non-event-based  
programming: XML::Twig (sub-DOM)
Fat and friendly: XML::Simple (full-DOM)

-Aaron

[1] I Am Not An XML-Perl Guru

On Nov 3, 2003, at 10:35 AM, Jason Stajich wrote:

> Josh - would love to see it one way or another - I assume XML::Twig
> doesn't give anything faster/less memory?  I have asked in the past if  
> the
> XML - Perl gurus could give us some hard and fast rules as to what is  
> the
> best set of tools to use.
>
> I think we are okay with ugly and uncommented code iff you are willing  
> to
> contribute it and then work on cleaning it up.  Since you don't show  
> the
> code I'm not clear what is so DIFFERENT about your coding style and
> whether or not that this truly incompatible.  The requirements we have  
> for
> something that would be a SeqIO module is they have to follow the
> structure of SeqIO drivers, mainly they implement next_seq and  
> write_seq
> and inherit from Bio::SeqIO and use the inherited _readline or _print  
> for
> IO rather than <$fh> and print $fh.
>
> You can contribute it be posting it to the list, asking nicely for CVS
> r/w account, or submitting it as an enhancement to  
> bugzilla.open-bio.org.
> Looking forward to it.
>
> -jason
>
> On Fri, 31 Oct 2003, Josh Lauricha wrote:
>
>> I've written a SeqIO parser for the tigr xml data format, and would  
>> like
>> to contribute it to BioPerl. However, there are a couple things I  
>> don't
>> really like about it but don't have the time to fix right now. Could I
>> get some feedback from the list regaurding each?
>>
>> First, some background. Since each XML file is roughly 60MB, using the
>> XML parsers provided by TIGR (using XML::Simple and XML::Sax, IIRC)
>> takes around 7-10 minutes to parse (no including BioPerl object
>> creation) and occationally used more than ~2.5GB of memory, which an  
>> x86
>> can't handle.
>>
>> To get around this, I took advantage of the fact that these are  
>> machine
>> generated and parsed the entire file using regexp, only storing what  
>> is
>> "relavent" to retrieve a sequence. This means, the ~75 lines of code
>> TIGR used is around 1280. However, it uses around 250MB of memory and
>> (converting from TIGR to GenBank) runs in around two to three and a  
>> half
>> minutes, 30-60% slower than GenBank -> GenBank convertion.
>>
>> 1) The code is pretty ugly. It was one of my first "large" perl  
>> projects
>>    and reflects that. The uglyness is partially due to my inexperiance
>>    at the time, and partially do to the ugliness of the problem.
>>
>> 2) Its not very well commented, ok its not commented. This isn't too  
>> big
>>    a problem, as everything acts basically the same way, and once
>>    someone understands that the rest is easy. (Its really just the  
>> same
>>    thing over and over). Its just fairly bad form.
>>
>> 3) The memory usage (and runtime) could be improved by one or more of:
>>    a) Storing everything directly into objects rather than a tree
>>    b) Using arrays to store everything rather than hashes
>>    c) Ignoring any tags that aren't actually used.
>>
>> 4) The coding style is nothing like the rest of BioPerl's. Mainly
>>    because, I prefer this style (PERSONAL preference, no flames,
>>    everyone gets their own oppinion). This is bad for a project,
>>    but in all honesty if I need to drastically change my coding
>>    style I will probably never get around to fixing up this code.
>>
>> 5) There is quite a long delay before anything is actually accessible
>>    because the nucleotide data is given at the end of the files
>>    (actually, at the end of an ASSEMBLY tag) so everything before it
>>    needs to be parsed. This leads to the first ->next_seq() call  
>> taking
>>    a significant time.
>>
>> Since I can't show you what the object looks like, I'll show you what
>> the GenBank file looks like. An example of the genbank file is at:
>>
>> http://bioinfo.ucr.edu/cgi-bin/seqfetch.pl? 
>> database=all&accession=At1g03870
>>
>> Thanks for your time,
>>
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>