[BioSQL-l] loading fasta records with load_seqdatabase.pl -
correctfasta headers
Hilmar Lapp
hlapp at gnf.org
Mon Aug 22 16:48:14 EDT 2005
Yes this is correct. The purpose of a SeqProcessor is exactly to
massage your data so that they are in the form you want them when they
enter the database.
-hilmar
On Aug 22, 2005, at 1:28 PM, Amit Indap wrote:
> Marc and Hilmar,
>
> Thanks for your responses. From my understanding I can write my own
> SequenceProcessor and override the process_seq to munge my data so
> that is
> is acceptable when loading my sequences in to biosql. I have a whole
> bunch of other sequences from the lab which don't have accessions, etc
> but I can write another pipeline to deal with and give them
> appropriate names and accessions. (If am mis-understanding what
> SeqProcessor is doing, please correct)
>
> Thanks,
> Amit
>
>
>
>
>
> On 8/22/05, Marc Logghe <MarcL at devgen.com> wrote:
>>> I think my fasta headers are incorrect since it says it
>>> cannot store unknown. The first fasta record in my refseq.fa is this:
>>>
>>>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
>>> domain (Ig), short basic domain, secreted, (semaphorin) 3E
>>> (SEMA3E), mRNA
>>>
>>> Do I need to reformat that header? I downloaded the NM series
>>> of Refseqs in fasta form from NCBI's ftp site and wanted to
>>> load them into the biosql schema.
>>
>> You'd definitely better change the display_name to NM_012431.1
>> You could first run the sequences through EMBOSS's seqret cleaning the
>> identifier.
>> Or you handle this in a seq processor. I'd opt for the latter.
>> Because you have to set your accession_number anyway. Thing is that a
>> sequence object from parsed fasta has no accession_number (set to the
>> default the well known 'unknown' ;-), only a display_name.
>> In the processor you can do all: clean up the display_name and pass
>> that
>> value to the accession_number() call.
>> The processor looks like this (save it as Accession.pm and put it
>> somewhere where perl can find it):
>>
>>
>> # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $
>> package Accession;
>> use vars qw(@ISA);
>> use strict;
>>
>> use Bio::Seq::BaseSeqProcessor;
>>
>> @ISA = qw(Bio::Seq::BaseSeqProcessor);
>>
>> sub _id_parser
>> {
>> return $_[0] =~ /gb\|([^|]+)/ ? $1 :
>> $_[0] =~ /^\s*\S+\|([^|]+)/ ? $1 :
>> $_[0] =~ /^\s*>*(\S+)/ ? $1 : $_[0];
>> }
>>
>>
>> sub process_seq{
>> my ($self,$seq) = @_;
>> my $display_id = _id_parser($seq->display_id);
>> $seq->accession_number($display_id);
>> return ($seq);
>> }
>>
>> 1;
>>
>>
>> Then you can add to your load_seqdatabase.pl command the option:
>> --pipeline "Accession"
>>
>> HTH,
>>
>> Marc
>>
>>
>
>
> --
> Real patriots ask questions.
> Carl Sagan
> http://aindap.blogspot.com/
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list