[Bioperl-l] Bio::Index::Fasta vs Bio::DB::Fasta

Ewan Birney birney@ebi.ac.uk
Mon, 14 Jan 2002 08:29:51 +0000 (GMT)


On Mon, 14 Jan 2002, Tony Cox wrote:

> On Sat, 12 Jan 2002, Ewan Birney wrote:
> 
> Just a note that I have a _lot_ of code and time invested internally in the
> Bio::Index::Fasta modules. It forms a fairly major plank of out internal
> sequence fetching architecture here in Sanger (along with the more complex
> functionality of SRS). Most of the time it is used for "normal" sequence
> fetching (EMBL clones etc) and not for chr-sized DNA chunks where the DB::Fasta
> really wins.It also compliments the Fastq modules that can be used to get
> matching quality data if it exists. 
> 
> In short does Index::Fasta  _have _ to go?


Definitely NOT!

We only want to add functionality to Bio::Index::Fasta not take
functionality away (right Lincoln?). This was my comment that monkeying
around with the code before 1.0 is fine as along as Lincoln is
super-confident that we wont break any existing installations.


I am tempted to say that this is a post 1.0 thing, but the additions maybe
very simple, in which case I think it is ok --- one can't get the fear for
not doing something...





> 
> Tony
> 
> 
> +>On Fri, 11 Jan 2002, Lincoln Stein wrote:
> +>
> +>> Hi Folks,
> +>> 
> +>> I've just recently become aware that Bio::Index::Fasta has very heavy 
> +>> overlapping functionality with Bio::DB::Fasta, and this is likely to lead to 
> +>> some user confusion down the road.
> +>> 
> +>> I would remove Bio::DB::Fasta in favor of the Bio::Index version, except that 
> +>> I don't think that Bio::Index::Fasta does the thing that first motivated 
> +>> Bio::DB::Fasta, which was the ability to retrieve subsequences efficiently.  
> +>> I have big (tens of megabyte) fasta files that contain 
> +>> whole C. elegans chromosomes, and want to fetch a few base pairs from the 
> +>> middle of them without reading the whole record into memory.  Can 
> +>> Bio::Index::Fasta do this?
> +>
> +>
> +>I am pretty sure it can't do this (which is why i believe you checked in
> +>DB::Fasta in the first place). Does DB::Fasta make assumptions about line
> +>length so it can SEEK to the right place?
> +>
> +>
> +>Clearly merging the two pieces would be great. It is not something I am
> +>overly worried about but it would be nice. 
> +>
> +>
> +>Two routes:
> +>
> +>(I am assumming that we are still calling it Bio::Index::Fasta...)
> +>
> +>  (a)
> +>
> +>     Bio::Index::Fasta gives back a Bio::SeqI complianant object which is
> +>actually a new thing called Bio::Seq::LargeFastaFixedLineLength (silly
> +>name...). This object does not load the sequence into memory but executes
> +>
> +>     $seq->subseq(100000,1000020);
> +>
> +>     with a SEEK.
> +>
> +>
> +>  (b) Bio::Index::Fasta will accept gets on slices
> +>
> +>
> +>Reading the documentation of Bio::DB::Fasta I notice that you have put
> +>nearly every access in (!) ---- I am always *so* impressed by your modules
> +>Lincoln, they nearly always have every route into them first off.
> +>
> +>
> +>
> +>So --- you have carte blanche to rearrange this area. As long as you are
> +>convinced that you wont be effecting exisiting FASTA indexes you can do
> +>what you like with Bio::Index::Fasta before 1.0 ---- it should work
> +>however with existing indexes - (ie, don't change the hash key
> +>representations etc).
> +>
> +>
> +>If you want to do a more serious reorganisation then it has got to be post
> +>1.0.
> +>
> +>
> +>
> +>Your choice of options and code.
> +>
> +>
> +>> 
> +>> Lincoln
> +>> 
> +>> 
> +>
> +>_______________________________________________
> +>Bioperl-l mailing list
> +>Bioperl-l@bioperl.org
> +>http://bioperl.org/mailman/listinfo/bioperl-l
> +>
> 
> ******************************************************
> Tony Cox			Email:avc@sanger.ac.uk
> Sanger Institute		WWW:www.sanger.ac.uk
> Wellcome Trust Genome Campus	Webmaster
> Hinxton				Tel: +44 1223 834244
> Cambs. CB10 1SA			Fax: +44 1223 494919
> ******************************************************
> 
>