[Bioperl-l] Re: [BioSQL-l] loading fasta records with
load_seqdatabase.pl - correct fasta headers
Hilmar Lapp
hlapp at gnf.org
Mon Aug 22 14:18:30 EDT 2005
Amit,
this is a problem inherent with the fasta format as there is no precise
definition of what to put as identifier and/or accession. The Bioperl
fasta parser doesn't set the accession and so it defaults to "unknown"
(it cannot be undef). Since fasta format also doesn't have the version
in a defined place, the version will be undef (i.e., zero for biosql)
for every entry, so that all your sequences will have the same unique
key of (accession,version,namespace) which violates the constraint
after the first sequence was stored.
The easiest way to deal with this is to write your own
SequenceProcessor (see Bio::Factory::SequenceProcessorI and
Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline
argument to load_seqdatabase.pl.
Simple examples for how to write your own SeqProcessor have been posted
before, e.g., by Marc Logghe:
http://portal.open-bio.org/pipermail/bioperl-l/2005-February/018158.html
and by myself
http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html
-hilmar
On Aug 22, 2005, at 7:57 AM, Amit Indap wrote:
> Hi,
>
> I am new to using the biosql. I am trying to load fasta formatted
> RefSeq records into the biosql schema. When I try to use the
> load_seqdatabase.pl script I get the following error
>
> load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql
> --namespace refseq --format fasta refseq.fa
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> were
> ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown
> ","PREDICTED:
> Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> ---------------------------------------------------
> Could not store unknown:
> ------------- EXCEPTION -------------
> MSG: You're trying to lie about the length: is 1316 but you say 6474
> STACK Bio::PrimarySeq::length
> /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
> 553
> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
> 553
> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:1310
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:1341
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:205
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:254
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
> 272
> STACK (eval) ./load_seqdatabase.pl:542
> STACK toplevel ./load_seqdatabase.pl:525
>
> --------------------------------------
> at ./load_seqdatabase.pl line 555
>
> I think my fasta headers are incorrect since it says it cannot store
> unknown. The first fasta record in my refseq.fa is this:
>
>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E),
> mRNA
>
> Do I need to reformat that header? I downloaded the NM series of
> Refseqs in fasta form from NCBI's ftp site and wanted to load them
> into the biosql schema.
>
> Thanks,
>
> Amit Indap
> Dept. of Biological Statistics and Computational Biology
> Cornell University
>
>
> (error message)
> Loading refseq.fa ...
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list