[BioSQL-l] loading fasta records with load_seqdatabase.pl -
correctfasta headers
Amit Indap
indapa at gmail.com
Mon Aug 22 16:28:04 EDT 2005
Marc and Hilmar,
Thanks for your responses. From my understanding I can write my own
SequenceProcessor and override the process_seq to munge my data so that is
is acceptable when loading my sequences in to biosql. I have a whole
bunch of other sequences from the lab which don't have accessions, etc
but I can write another pipeline to deal with and give them
appropriate names and accessions. (If am mis-understanding what
SeqProcessor is doing, please correct)
Thanks,
Amit
On 8/22/05, Marc Logghe <MarcL at devgen.com> wrote:
> > I think my fasta headers are incorrect since it says it
> > cannot store unknown. The first fasta record in my refseq.fa is this:
> >
> > >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> > domain (Ig), short basic domain, secreted, (semaphorin) 3E
> > (SEMA3E), mRNA
> >
> > Do I need to reformat that header? I downloaded the NM series
> > of Refseqs in fasta form from NCBI's ftp site and wanted to
> > load them into the biosql schema.
>
> You'd definitely better change the display_name to NM_012431.1
> You could first run the sequences through EMBOSS's seqret cleaning the
> identifier.
> Or you handle this in a seq processor. I'd opt for the latter.
> Because you have to set your accession_number anyway. Thing is that a
> sequence object from parsed fasta has no accession_number (set to the
> default the well known 'unknown' ;-), only a display_name.
> In the processor you can do all: clean up the display_name and pass that
> value to the accession_number() call.
> The processor looks like this (save it as Accession.pm and put it
> somewhere where perl can find it):
>
>
> # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $
> package Accession;
> use vars qw(@ISA);
> use strict;
>
> use Bio::Seq::BaseSeqProcessor;
>
> @ISA = qw(Bio::Seq::BaseSeqProcessor);
>
> sub _id_parser
> {
> return $_[0] =~ /gb\|([^|]+)/ ? $1 :
> $_[0] =~ /^\s*\S+\|([^|]+)/ ? $1 :
> $_[0] =~ /^\s*>*(\S+)/ ? $1 : $_[0];
> }
>
>
> sub process_seq{
> my ($self,$seq) = @_;
> my $display_id = _id_parser($seq->display_id);
> $seq->accession_number($display_id);
> return ($seq);
> }
>
> 1;
>
>
> Then you can add to your load_seqdatabase.pl command the option:
> --pipeline "Accession"
>
> HTH,
>
> Marc
>
>
--
Real patriots ask questions.
Carl Sagan
http://aindap.blogspot.com/
More information about the BioSQL-l
mailing list