[Bioperl-l] LargeSeq performance

Jason Stajich jason at cgt.duhs.duke.edu
Wed Oct 29 12:32:30 EST 2003


Yeah I think it is pretty much not a good solution in the end.
My strategy would be separate annotation from the actual sequence.
Put the sequence in fasta files and use Lincoln's Bio::DB::Fasta which
allows random access and subsequence retrieval quite nicely for large
files or lots of sequences.  Perhaps we should just think about abandoning
LargeSeq for the Indexed approach that Lincoln uses.

In addition to parse large chromosome files with annotations would need to
fix-up/check that SeqIO::genbank/embl can still parse a file which is a
CONTIG file or a genbank record which doesn't have any sequence (all you
want is the annotations/features anyways).  This might need a little more
tweaking to make sure it works.  I feel like the SeqIO parsing right now
is pretty fragile to certain types of changes so I can't say that it would
right now out of the box.

That's the way I'd go, personally but may be a more proper engineering
solution.

-jason

On Wed, 29 Oct 2003, Stefan Kirov wrote:

> I have a problem with the performance of  LargeSeq. I am working with
> whole chromosomes (mouse, human) and next_seq takes forever.
> I do not know if it is worth, since any portion can be read with random
> access, but I am still curious to know id pepople think it might be a
> good idea to create an object, that hadles extremely large sequences-
> whole chromosomes for example without impact on the performance?
> If you think it's worth I can try to do it. What I have in mind is use
> grep to map the record separators ">" (in case you are mad enogh to put
> more than one chromosome in a single file). Thus next_seq will know
> where to look for the next sequence and, parse the id line and calc the
> length. And I doubt anyone will use this under Windows (anyway, OS can
> be checked to avoid problems). Also the object will use random
> accessinstead of  Bio::Root::IO to get sequence data.
> Let me know what you think...
> Stefan Kirov
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list