[Bioperl-l] Bio::SeqIO::tigr

Mon Nov 3 15:57:23 EST 2003

On Mon 11/03/03 10:35, Jason Stajich wrote:
> Josh - would love to see it one way or another - I assume XML::Twig
> doesn't give anything faster/less memory?  I have asked in the past if the
> XML - Perl gurus could give us some hard and fast rules as to what is the
> best set of tools to use.

Not an XML Guru, but Twig seems to work really well in basically anycase
I've come acrossed.

Twig would work... if TIGR wasn't stupid. The XML file is roughly:
<ASSEMBLY>
    <TU>
        <MODEL>
            ...
        </MODEL>
        ...
    </TU>
    ..
</ASSEMBLY>

The assembly tag is the basically the entire file and, unfortunatly, has
needed information. So, twig doesn't quite work. I guess if there is a
way to tell it to mix the SAX way and TWIG way that might work... But,
if there is I don't know about it. Actually, as of writing this email, I
just dug up a Twig based XML parser for Tigr... however all it does is
spit out the IDs, descriptions and TU coords (no sequences), this takes
~110MB of RAM and almost 4 Minutes. On the same file, mine is ~246MB RAM
and an extra minute, however its a full GenBank dump.

> I think we are okay with ugly and uncommented code iff you are willing to
> contribute it and then work on cleaning it up.  Since you don't show the
> code I'm not clear what is so DIFFERENT about your coding style and
> whether or not that this truly incompatible. 

The reason its wasn't attached is because I've had issues with posting
attached files before. The different style is more or less just with
idents, so stuff I've seen like:

if() {
foreach () {
if() {
do something
}
}
}

is:

if() {
    foreach () {
        if() {
            do something
        }
    }
}

Basically, just because I can't read it. 

> The requirements we have for
> something that would be a SeqIO module is they have to follow the
> structure of SeqIO drivers, mainly they implement next_seq and write_seq
> and inherit from Bio::SeqIO and use the inherited _readline or _print for
> IO rather than <$fh> and print $fh.

My module uses the _readline interface. write_seq isn't implemented
because it doesn't make any sense to do so.

> You can contribute it be posting it to the list, asking nicely for CVS
> r/w account, or submitting it as an enhancement to bugzilla.open-bio.org.
> Looking forward to it.

I'll post another e-mail will it attached.

> 
> -jason
> 
> On Fri, 31 Oct 2003, Josh Lauricha wrote:
> 
> > I've written a SeqIO parser for the tigr xml data format, and would like
> > to contribute it to BioPerl. However, there are a couple things I don't
> > really like about it but don't have the time to fix right now. Could I
> > get some feedback from the list regaurding each?
> >
> > First, some background. Since each XML file is roughly 60MB, using the
> > XML parsers provided by TIGR (using XML::Simple and XML::Sax, IIRC)
> > takes around 7-10 minutes to parse (no including BioPerl object
> > creation) and occationally used more than ~2.5GB of memory, which an x86
> > can't handle.
> >
> > To get around this, I took advantage of the fact that these are machine
> > generated and parsed the entire file using regexp, only storing what is
> > "relavent" to retrieve a sequence. This means, the ~75 lines of code
> > TIGR used is around 1280. However, it uses around 250MB of memory and
> > (converting from TIGR to GenBank) runs in around two to three and a half
> > minutes, 30-60% slower than GenBank -> GenBank convertion.
> >
> > 1) The code is pretty ugly. It was one of my first "large" perl projects
> >    and reflects that. The uglyness is partially due to my inexperiance
> >    at the time, and partially do to the ugliness of the problem.
> >
> > 2) Its not very well commented, ok its not commented. This isn't too big
> >    a problem, as everything acts basically the same way, and once
> >    someone understands that the rest is easy. (Its really just the same
> >    thing over and over). Its just fairly bad form.
> >
> > 3) The memory usage (and runtime) could be improved by one or more of:
> >    a) Storing everything directly into objects rather than a tree
> >    b) Using arrays to store everything rather than hashes
> >    c) Ignoring any tags that aren't actually used.
> >
> > 4) The coding style is nothing like the rest of BioPerl's. Mainly
> >    because, I prefer this style (PERSONAL preference, no flames,
> >    everyone gets their own oppinion). This is bad for a project,
> >    but in all honesty if I need to drastically change my coding
> >    style I will probably never get around to fixing up this code.
> >
> > 5) There is quite a long delay before anything is actually accessible
> >    because the nucleotide data is given at the end of the files
> >    (actually, at the end of an ASSEMBLY tag) so everything before it
> >    needs to be parsed. This leads to the first ->next_seq() call taking
> >    a significant time.
> >
> > Since I can't show you what the object looks like, I'll show you what
> > the GenBank file looks like. An example of the genbank file is at:
> >
> > http://bioinfo.ucr.edu/cgi-bin/seqfetch.pl?database=all&accession=At1g03870
> >
> > Thanks for your time,
> >
> >
> 
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 

-- 

----------------------------
| Josh Lauricha            |
| laurichj at bioinfo.ucr.edu |
| Bioinformatics, UCR      |
|--------------------------|