[Bioperl-l] load_seqdatabase.pl does not like fasta format
Marc Logghe
Marc.Logghe at devgen.com
Mon Jun 14 03:43:53 EDT 2004
Hi Andy,
> your fasta sequence was 'unknown'. Since the triple of
> (accession,version,namespace) is constrained by and used as a unique
> key, and given that fasta doesn't provide version numbers, your
> sequences will all be considered identical if the accession is
> 'unknown' for all of them. I.e., after the first one is
> inserted, the
> second one and all others will fail to insert.
That is because when you load from fasta, the seqID goes into the bioperl display_name slot and finally into the biosql name field.
The accession number (bioperl accession_number slot) is empty and set to unknown by default. As this slot ends up in the accession field in the biosql schema, you end up into troubles because EVERY accession will be unknown.
I solved this be adding a --pipeline argument (e.g. Bio::SeqProcessor::Accession) with a really simple SeqProcessor that copies the display_name into the accesion_number slot
package Bio::SeqProcessor::Accession;
use strict;
use vars qw(@ISA);
use Bio::Seq::BaseSeqProcessor;
sub process_seq{
my ($self,$seq) = @_;
my $display_id = $seq->display_id;
$seq->accession_number($display_id);
return ($seq);
}
HTH,
Marc
More information about the Bioperl-l
mailing list