[Bioperl-l] Re: [BioSQL-l] loading fasta records with
load_seqdatabase.pl - correct fasta headers
Hilmar Lapp
hlapp at gnf.org
Tue Aug 23 15:43:56 EDT 2005
I guess it may be worth to deposit a suitable SeqProcessor for this
type of ID in the repository as probably many people may find it
useful.
On Aug 23, 2005, at 1:53 AM, mark.schreiber at novartis.com wrote:
> The NCBI 'standard' is to format the header like this:
>
>> gi|{identifier}|{namespace}|{accession}.{version}|{accession}
>> description
>
> eg
>
>> gi|123456|gb|AE657483.3|AE657483.3 Gene of interest from Flying
>> Spaghetti
> Monster.
>
> Biojava is going to be adopting this approach when the appropriate
> information is available.
>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax +65 6722 2910
>
>
>
>
>
> Hilmar Lapp <hlapp at gnf.org>
> Sent by: biosql-l-bounces at portal.open-bio.org
> 08/23/2005 02:18 AM
>
>
> To: Amit Indap <indapa at gmail.com>
> cc: Bioperl <bioperl-l at bioperl.org>, Biosql
> <biosql-l at open-bio.org>, (bcc:
> Mark Schreiber/GP/Novartis)
> Subject: Re: [BioSQL-l] loading fasta records with
> load_seqdatabase.pl - correct
> fasta headers
>
>
> Amit,
>
> this is a problem inherent with the fasta format as there is no precise
> definition of what to put as identifier and/or accession. The Bioperl
> fasta parser doesn't set the accession and so it defaults to "unknown"
> (it cannot be undef). Since fasta format also doesn't have the version
> in a defined place, the version will be undef (i.e., zero for biosql)
> for every entry, so that all your sequences will have the same unique
> key of (accession,version,namespace) which violates the constraint
> after the first sequence was stored.
>
> The easiest way to deal with this is to write your own
> SequenceProcessor (see Bio::Factory::SequenceProcessorI and
> Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline
> argument to load_seqdatabase.pl.
>
> Simple examples for how to write your own SeqProcessor have been posted
> before, e.g., by Marc Logghe:
>
> http://portal.open-bio.org/pipermail/bioperl-l/2005-February/
> 018158.html
>
> and by myself
>
> http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html
>
> -hilmar
>
> On Aug 22, 2005, at 7:57 AM, Amit Indap wrote:
>
>> Hi,
>>
>> I am new to using the biosql. I am trying to load fasta formatted
>> RefSeq records into the biosql schema. When I try to use the
>> load_seqdatabase.pl script I get the following error
>>
>> load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql
>> --namespace refseq --format fasta refseq.fa
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
>> were
>> ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknow
>> n
>> ","PREDICTED:
>> Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,<NULL>)
>> Duplicate entry 'unknown-1-0' for key 2
>> ---------------------------------------------------
>> Could not store unknown:
>> ------------- EXCEPTION -------------
>> MSG: You're trying to lie about the length: is 1316 but you say 6474
>> STACK Bio::PrimarySeq::length
>> /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418
>> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
>> 553
>> STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612
>> STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
>> 553
>> STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:1310
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:976
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:855
>> STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284
>> STACK Bio::DB::BioSQL::SeqAdaptor::attach_children
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:1341
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:976
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:855
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:205
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/
>> BasePersistenceAdaptor.pm:254
>> STACK Bio::DB::Persistent::PersistentObject::store
>> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:
>> 272
>> STACK (eval) ./load_seqdatabase.pl:542
>> STACK toplevel ./load_seqdatabase.pl:525
>>
>> --------------------------------------
>> at ./load_seqdatabase.pl line 555
>>
>> I think my fasta headers are incorrect since it says it cannot store
>> unknown. The first fasta record in my refseq.fa is this:
>>
>>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin
>> domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E),
>> mRNA
>>
>> Do I need to reformat that header? I downloaded the NM series of
>> Refseqs in fasta form from NCBI's ftp site and wanted to load them
>> into the biosql schema.
>>
>> Thanks,
>>
>> Amit Indap
>> Dept. of Biological Statistics and Computational Biology
>> Cornell University
>>
>>
>> (error message)
>> Loading refseq.fa ...
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at open-bio.org
>> http://open-bio.org/mailman/listinfo/biosql-l
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp at gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
>
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list