[BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers

mark.schreiber at novartis.com mark.schreiber at novartis.com
Tue Aug 23 04:53:21 EDT 2005


The NCBI 'standard' is to format the header like this:

>gi|{identifier}|{namespace}|{accession}.{version}|{accession} description

eg

>gi|123456|gb|AE657483.3|AE657483.3 Gene of interest from Flying Spaghetti 
Monster.

Biojava is going to be adopting this approach when the appropriate 
information is available.

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910





Hilmar Lapp <hlapp at gnf.org>
Sent by: biosql-l-bounces at portal.open-bio.org
08/23/2005 02:18 AM

 
        To:     Amit Indap <indapa at gmail.com>
        cc:     Bioperl <bioperl-l at bioperl.org>, Biosql <biosql-l at open-bio.org>, (bcc: 
Mark Schreiber/GP/Novartis)
        Subject:        Re: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct 
fasta headers


Amit,

this is a problem inherent with the fasta format as there is no precise 
definition of what to put as identifier and/or accession. The Bioperl 
fasta parser doesn't set the accession and so it defaults to "unknown" 
(it cannot be undef). Since fasta format also doesn't have the version 
in a defined place, the version will be undef (i.e., zero for biosql) 
for every entry, so that all your sequences will have the same unique 
key of (accession,version,namespace) which violates the constraint 
after the first sequence was stored.

The easiest way to deal with this is to write your own 
SequenceProcessor (see Bio::Factory::SequenceProcessorI and 
Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline 
argument to load_seqdatabase.pl.

Simple examples for how to write your own SeqProcessor have been posted 
before, e.g., by Marc Logghe:

http://portal.open-bio.org/pipermail/bioperl-l/2005-February/018158.html

and by myself

http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html

                 -hilmar

On Aug 22, 2005, at 7:57 AM, Amit Indap wrote:

> Hi,
>
> I am new to using the biosql. I am trying to load fasta formatted
> RefSeq records into the biosql schema. When I try to use the
> load_seqdatabase.pl script I get the following error
>
> load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql
> --namespace refseq --format fasta refseq.fa
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> were 
> ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown 
> ","PREDICTED:
> Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> ---------------------------------------------------
> Could not store unknown:
> ------------- EXCEPTION  -------------
> MSG: You're trying to lie about the length: is 1316 but you say 6474
> STACK Bio::PrimarySeq::length
> /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: 
> 553
> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: 
> 553
> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:1310
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:1341
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:205
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:254
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: 
> 272
> STACK (eval) ./load_seqdatabase.pl:542
> STACK toplevel ./load_seqdatabase.pl:525
>
> --------------------------------------
>  at ./load_seqdatabase.pl line 555
>
> I think my fasta headers are incorrect since it says it cannot store
> unknown. The first fasta record in my refseq.fa is this:
>
>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E),
> mRNA
>
> Do I need to reformat that header? I downloaded the NM series of
> Refseqs in fasta form from NCBI's ftp site and wanted to load them
> into the biosql schema.
>
> Thanks,
>
> Amit Indap
> Dept. of Biological Statistics and Computational Biology
> Cornell University
>
>
> (error message)
> Loading refseq.fa ...
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


_______________________________________________
BioSQL-l mailing list
BioSQL-l at open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l





More information about the BioSQL-l mailing list