[Bioperl-l] Bio::Index::Fasta vs Bio::DB::Fasta
Lincoln Stein
lstein@cshl.org
Mon, 21 Jan 2002 10:01:08 -0500
Hi,
Just getting back from the Perl Whirl Geek Cruise, all relaxed
suntanned, and ready to answer 1000 e-mail messages!
Ewan Birney writes:
> On Fri, 11 Jan 2002, Lincoln Stein wrote:
>
> > Hi Folks,
> >
> > I've just recently become aware that Bio::Index::Fasta has very heavy
> > overlapping functionality with Bio::DB::Fasta, and this is likely to lead to
> > some user confusion down the road.
> >
> > I would remove Bio::DB::Fasta in favor of the Bio::Index version, except that
> > I don't think that Bio::Index::Fasta does the thing that first motivated
> > Bio::DB::Fasta, which was the ability to retrieve subsequences efficiently.
> > I have big (tens of megabyte) fasta files that contain
> > whole C. elegans chromosomes, and want to fetch a few base pairs from the
> > middle of them without reading the whole record into memory. Can
> > Bio::Index::Fasta do this?
>
>
> I am pretty sure it can't do this (which is why i believe you checked in
> DB::Fasta in the first place). Does DB::Fasta make assumptions about line
> length so it can SEEK to the right place?
As DB::Fasta is reading the FASTA files it stores information about
the line lengths it encounters. So each FASTA file can have a
different line length, and indeed each entry within each FASTA file
can have a different line length (but line lengths must be uniform
within an entry).
> Clearly merging the two pieces would be great. It is not something I am
> overly worried about but it would be nice.
>
>
> Two routes:
>
> (I am assumming that we are still calling it Bio::Index::Fasta...)
>
> (a)
>
> Bio::Index::Fasta gives back a Bio::SeqI complianant object which is
> actually a new thing called Bio::Seq::LargeFastaFixedLineLength (silly
> name...). This object does not load the sequence into memory but executes
>
> $seq->subseq(100000,1000020);
>
> with a SEEK.
>
>
> (b) Bio::Index::Fasta will accept gets on slices
>
>
> Reading the documentation of Bio::DB::Fasta I notice that you have put
> nearly every access in (!) ---- I am always *so* impressed by your modules
> Lincoln, they nearly always have every route into them first off.
>
>
>
> So --- you have carte blanche to rearrange this area. As long as you are
> convinced that you wont be effecting exisiting FASTA indexes you can do
> what you like with Bio::Index::Fasta before 1.0 ---- it should work
> however with existing indexes - (ie, don't change the hash key
> representations etc).
Scary. For the time being I've removed Bio::DB::Fasta dependencies
from Bio::DB::GFF and LDAS. I think I'll leave the big reorganization
until after 1.0. So much to do before the conference....
Lincoln
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================