NCBI fasta format [was: Re: [Bioperl-l] loading data into
bioperl-db]
Hilmar Lapp
hlapp at gnf.org
Fri Jun 6 20:29:14 EDT 2003
On Friday, June 6, 2003, at 06:37 PM, Aaron J Mackey wrote:
>
> On Fri, 6 Jun 2003, Hilmar Lapp wrote:
>
>> The advantage is you can modify and tweak it easily at any time and
>> plug
>> it back in (no make / install or messing with perl libraries), and you
>> can use it for any format, not just fasta.
>
> Right; the SeqProcessor route is very powerful, and can easily handle
> this
> task. But isn't that overkill for this? I thought that this "bundled
> id"
> NCBI uses only shows up in the fasta files they generate (so using it
> for
> other formats is a moot point).
Well, this has meanwhile been adopted by many other providers (Celera,
Affy, you name it) too. Also very common is to convolute things in the
description line (the dreaded /xx=yy syntax) that you may want to take
apart, e.g., before you load this into biosql. The differences aren't
necessarily big; what I'm saying is basically even a tiny difference
may want you to do the parsing in a slightly different way. At least I
find myself frequently in this position, so I have a whole bunch of
those SeqProcessors lying around. Would you want to have
SeqIO::fasta_ncbi.pm, SeqIO::fasta_affx.pm, SeqIO::fasta_ensembl, etc?.
I guess it depends on how you look at it and on your taste. To me,
writing, maintaining, and distributing in bioperl, and installing on
the user's end a SeqIO parser that can process NCBI-flavor but not
another is overkill. Do you really want to maintain this for every
change that NCBI decides to make? A little SeqProcessor module or a
whole collection of them would basically have the same status as a
contributed script - use at your own risk, consider it as a starter
only, you may need to tweak it, but maybe it works out of the box as it
did for me.
> A Bio::SeqIO::fasta_ncbi (or some other,
> better name) would be simple for beginners, and would be useful for
> Bio::DB::GenBank as well (that is, if Bio::DB::GenBank still requests
> fasta-formatted data from Entrez).
You can plug in a SeqProcessor there just as well.
I'm not against having SeqIO::fasta_ncbi, if someone wants to do that.
I'm just not very enthusiastic about adopting modules that do something
in a certain way when other people may want to do it differently, but
that are then difficult to extend to be flexible and configurable.
I.e., I'm hesitant to can a client logic into a module that is then
supposed to be maintained in bioperl.
-hilmar
>
> -Aaron
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list