[BioSQL-l] loading fasta records with load_seqdatabase.pl - correctfasta headers

Mon Aug 22 11:51:42 EDT 2005

> I think my fasta headers are incorrect since it says it 
> cannot store unknown. The first fasta record in my refseq.fa is this:
> 
> >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> domain (Ig), short basic domain, secreted, (semaphorin) 3E 
> (SEMA3E), mRNA
> 
> Do I need to reformat that header? I downloaded the NM series 
> of Refseqs in fasta form from NCBI's ftp site and wanted to 
> load them into the biosql schema.

You'd definitely better change the display_name to NM_012431.1
You could first run the sequences through EMBOSS's seqret cleaning the
identifier.
Or you handle this in a seq processor. I'd opt for the latter.
Because you have to set your accession_number anyway. Thing is that a
sequence object from parsed fasta has no accession_number (set to the
default the well known 'unknown' ;-), only a display_name.
In the processor you can do all: clean up the display_name and pass that
value to the accession_number() call.
The processor looks like this (save it as Accession.pm and put it
somewhere where perl can find it):

# $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $
package Accession;
use vars qw(@ISA);
use strict;

use Bio::Seq::BaseSeqProcessor;

@ISA = qw(Bio::Seq::BaseSeqProcessor);

sub _id_parser
{
  return $_[0] =~ /gb\|([^|]+)/       ? $1 :
         $_[0] =~ /^\s*\S+\|([^|]+)/  ? $1 :
         $_[0] =~ /^\s*>*(\S+)/       ? $1 : $_[0];
}

sub process_seq{
    my ($self,$seq) = @_;
    my $display_id = _id_parser($seq->display_id);
    $seq->accession_number($display_id);
    return ($seq);
}

1;

Then you can add to your load_seqdatabase.pl command the option:
--pipeline "Accession"

HTH,

Marc