[Bioperl-l] Re: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers

Mon Aug 22 14:18:30 EDT 2005

Amit,

this is a problem inherent with the fasta format as there is no precise  
definition of what to put as identifier and/or accession. The Bioperl  
fasta parser doesn't set the accession and so it defaults to "unknown"  
(it cannot be undef). Since fasta format also doesn't have the version  
in a defined place, the version will be undef (i.e., zero for biosql)  
for every entry, so that all your sequences will have the same unique  
key of (accession,version,namespace) which violates the constraint  
after the first sequence was stored.

The easiest way to deal with this is to write your own  
SequenceProcessor (see Bio::Factory::SequenceProcessorI and  
Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline  
argument to load_seqdatabase.pl.

Simple examples for how to write your own SeqProcessor have been posted  
before, e.g., by Marc Logghe:

http://portal.open-bio.org/pipermail/bioperl-l/2005-February/018158.html

and by myself

http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html

	-hilmar

On Aug 22, 2005, at 7:57 AM, Amit Indap wrote:

> Hi,
>
> I am new to using the biosql. I am trying to load fasta formatted
> RefSeq records into the biosql schema. When I try to use the
> load_seqdatabase.pl script I get the following error
>
> load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql
> --namespace refseq --format fasta refseq.fa
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> were  
> ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown 
> ","PREDICTED:
> Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> ---------------------------------------------------
> Could not store unknown:
> ------------- EXCEPTION  -------------
> MSG: You're trying to lie about the length: is 1316 but you say 6474
> STACK Bio::PrimarySeq::length
> /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: 
> 553
> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: 
> 553
> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:1310
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:1341
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:205
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:254
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: 
> 272
> STACK (eval) ./load_seqdatabase.pl:542
> STACK toplevel ./load_seqdatabase.pl:525
>
> --------------------------------------
>  at ./load_seqdatabase.pl line 555
>
> I think my fasta headers are incorrect since it says it cannot store
> unknown. The first fasta record in my refseq.fa is this:
>
>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E),
> mRNA
>
> Do I need to reformat that header? I downloaded the NM series of
> Refseqs in fasta form from NCBI's ftp site and wanted to load them
> into the biosql schema.
>
> Thanks,
>
> Amit Indap
> Dept. of Biological Statistics and Computational Biology
> Cornell University
>
>
> (error message)
> Loading refseq.fa ...
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------