[Bioperl-l] load_seqdatabase.pl does not like fasta format

Marc Logghe Marc.Logghe at devgen.com
Mon Jun 14 03:43:53 EDT 2004


Hi Andy,

> your fasta sequence was 'unknown'. Since the triple of  
> (accession,version,namespace) is constrained by and used as a unique  
> key, and given that fasta doesn't provide version numbers, your  
> sequences will all be considered identical if the accession is  
> 'unknown' for all of them. I.e., after the first one is 
> inserted, the  
> second one and all others will fail to insert.
That is because when you load from fasta, the seqID goes into the bioperl display_name slot and finally into the biosql name field.
The accession number (bioperl accession_number slot) is empty and set to unknown by default. As this slot ends up in the accession field in the biosql schema, you end up into troubles because EVERY accession will be unknown.
I solved this be adding a --pipeline argument (e.g. Bio::SeqProcessor::Accession) with a really simple SeqProcessor that copies the display_name into the accesion_number slot

package Bio::SeqProcessor::Accession;
use strict;
use vars qw(@ISA);
use Bio::Seq::BaseSeqProcessor;
sub process_seq{
    my ($self,$seq) = @_;
    my $display_id = $seq->display_id;
    $seq->accession_number($display_id);
    return ($seq);
}


HTH,
Marc




More information about the Bioperl-l mailing list