[Bioperl-l] Re: [BioSQL-l] loading fasta records with
load_seqdatabase.pl - correct fasta headers
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Tue Aug 23 04:53:21 EDT 2005
The NCBI 'standard' is to format the header like this:
>gi|{identifier}|{namespace}|{accession}.{version}|{accession} description
eg
>gi|123456|gb|AE657483.3|AE657483.3 Gene of interest from Flying Spaghetti
Monster.
Biojava is going to be adopting this approach when the appropriate
information is available.
- Mark
Mark Schreiber
Principal Scientist (Bioinformatics)
Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com
phone +65 6722 2973
fax +65 6722 2910
Hilmar Lapp <hlapp at gnf.org>
Sent by: biosql-l-bounces at portal.open-bio.org
08/23/2005 02:18 AM
To: Amit Indap <indapa at gmail.com>
cc: Bioperl <bioperl-l at bioperl.org>, Biosql <biosql-l at open-bio.org>, (bcc:
Mark Schreiber/GP/Novartis)
Subject: Re: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct
fasta headers
Amit,
this is a problem inherent with the fasta format as there is no precise
definition of what to put as identifier and/or accession. The Bioperl
fasta parser doesn't set the accession and so it defaults to "unknown"
(it cannot be undef). Since fasta format also doesn't have the version
in a defined place, the version will be undef (i.e., zero for biosql)
for every entry, so that all your sequences will have the same unique
key of (accession,version,namespace) which violates the constraint
after the first sequence was stored.
The easiest way to deal with this is to write your own
SequenceProcessor (see Bio::Factory::SequenceProcessorI and
Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline
argument to load_seqdatabase.pl.
Simple examples for how to write your own SeqProcessor have been posted
before, e.g., by Marc Logghe:
http://portal.open-bio.org/pipermail/bioperl-l/2005-February/018158.html
and by myself
http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html
-hilmar
On Aug 22, 2005, at 7:57 AM, Amit Indap wrote:
> Hi,
>
> I am new to using the biosql. I am trying to load fasta formatted
> RefSeq records into the biosql schema. When I try to use the
> load_seqdatabase.pl script I get the following error
>
> load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql
> --namespace refseq --format fasta refseq.fa
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> were
> ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown
> ","PREDICTED:
> Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,<NULL>)
> Duplicate entry 'unknown-1-0' for key 2
> ---------------------------------------------------
> Could not store unknown:
> ------------- EXCEPTION -------------
> MSG: You're trying to lie about the length: is 1316 but you say 6474
> STACK Bio::PrimarySeq::length
> /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
> 553
> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612
> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
> 553
> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:1310
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:1341
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:976
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:205
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
> BasePersistenceAdaptor.pm:254
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
> 272
> STACK (eval) ./load_seqdatabase.pl:542
> STACK toplevel ./load_seqdatabase.pl:525
>
> --------------------------------------
> at ./load_seqdatabase.pl line 555
>
> I think my fasta headers are incorrect since it says it cannot store
> unknown. The first fasta record in my refseq.fa is this:
>
>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
> domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E),
> mRNA
>
> Do I need to reformat that header? I downloaded the NM series of
> Refseqs in fasta form from NCBI's ftp site and wanted to load them
> into the biosql schema.
>
> Thanks,
>
> Amit Indap
> Dept. of Biological Statistics and Computational Biology
> Cornell University
>
>
> (error message)
> Loading refseq.fa ...
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
_______________________________________________
BioSQL-l mailing list
BioSQL-l at open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l
More information about the Bioperl-l
mailing list