NCBI fasta format [was: Re: [Bioperl-l] loading data into bioperl-db]

Fri Jun 6 20:29:14 EDT 2003

On Friday, June 6, 2003, at 06:37  PM, Aaron J Mackey wrote:

>
> On Fri, 6 Jun 2003, Hilmar Lapp wrote:
>
>> The advantage is you can modify and tweak it easily at any time and 
>> plug
>> it back in (no make / install or messing with perl libraries), and you
>> can use it for any format, not just fasta.
>
> Right; the SeqProcessor route is very powerful, and can easily handle 
> this
> task.  But isn't that overkill for this?  I thought that this "bundled 
> id"
> NCBI uses only shows up in the fasta files they generate (so using it 
> for
> other formats is a moot point).

Well, this has meanwhile been adopted by many other providers (Celera, 
Affy, you name it) too. Also very common is to convolute things in the 
description line (the dreaded /xx=yy syntax) that you may want to take 
apart, e.g., before you load this into biosql. The differences aren't 
necessarily big; what I'm saying is basically even a tiny difference 
may want you to do the parsing in a slightly different way. At least I 
find myself frequently in this position, so I have a whole bunch of 
those SeqProcessors lying around. Would you want to have 
SeqIO::fasta_ncbi.pm, SeqIO::fasta_affx.pm, SeqIO::fasta_ensembl, etc?.

I guess it depends on how you look at it and on your taste. To me, 
writing, maintaining, and distributing in bioperl, and installing on 
the user's end a SeqIO parser that can process NCBI-flavor but not 
another is overkill. Do you really want to maintain this for every 
change that NCBI decides to make? A little SeqProcessor module or a 
whole collection of them would basically have the same status as a 
contributed script - use at your own risk, consider it as a starter 
only, you may need to tweak it, but maybe it works out of the box as it 
did for me.

>   A Bio::SeqIO::fasta_ncbi (or some other,
> better name) would be simple for beginners, and would be useful for
> Bio::DB::GenBank as well (that is, if Bio::DB::GenBank still requests
> fasta-formatted data from Entrez).

You can plug in a SeqProcessor there just as well.

I'm not against having SeqIO::fasta_ncbi, if someone wants to do that. 
I'm just not very enthusiastic about adopting modules that do something 
in a certain way when other people may want to do it differently, but 
that are then difficult to extend to be flexible and configurable. 
I.e., I'm hesitant to can a client logic into a module that is then 
supposed to be maintained in bioperl.

	-hilmar

>
> -Aaron
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------