[BioSQL-l] loading fasta records with load_seqdatabase.pl -
correctfasta headers
Marc Logghe
MarcL at DEVGEN.com
Mon Aug 22 11:51:42 EDT 2005
> I think my fasta headers are incorrect since it says it
> cannot store unknown. The first fasta record in my refseq.fa is this:
>
> >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> domain (Ig), short basic domain, secreted, (semaphorin) 3E
> (SEMA3E), mRNA
>
> Do I need to reformat that header? I downloaded the NM series
> of Refseqs in fasta form from NCBI's ftp site and wanted to
> load them into the biosql schema.
You'd definitely better change the display_name to NM_012431.1
You could first run the sequences through EMBOSS's seqret cleaning the
identifier.
Or you handle this in a seq processor. I'd opt for the latter.
Because you have to set your accession_number anyway. Thing is that a
sequence object from parsed fasta has no accession_number (set to the
default the well known 'unknown' ;-), only a display_name.
In the processor you can do all: clean up the display_name and pass that
value to the accession_number() call.
The processor looks like this (save it as Accession.pm and put it
somewhere where perl can find it):
# $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $
package Accession;
use vars qw(@ISA);
use strict;
use Bio::Seq::BaseSeqProcessor;
@ISA = qw(Bio::Seq::BaseSeqProcessor);
sub _id_parser
{
return $_[0] =~ /gb\|([^|]+)/ ? $1 :
$_[0] =~ /^\s*\S+\|([^|]+)/ ? $1 :
$_[0] =~ /^\s*>*(\S+)/ ? $1 : $_[0];
}
sub process_seq{
my ($self,$seq) = @_;
my $display_id = _id_parser($seq->display_id);
$seq->accession_number($display_id);
return ($seq);
}
1;
Then you can add to your load_seqdatabase.pl command the option:
--pipeline "Accession"
HTH,
Marc
More information about the BioSQL-l
mailing list