[Bioperl-l] Error while running load_seqdatabase.pl

Tue Jan 30 17:32:09 UTC 2007

Hi Hilmar,

  I have been trying to get around this problem for some days now, but havent had much luck. My file has about 12000 odd records, and when I try to load it with a pipeline to the package, I have about 483 records that get loaded. 

  sub process_seq
{
  my ($self, $seq) = @_;
#  $seq->accession_number($seq->display_id);
  my @ids = split(/\|/,$seq->display_id);
  $seq->accession_number($ids[0]);
  $seq->primary_id($ids[2]);
  return ($seq);
}

  I am assuming this is because I use the ids[2] for the primary_id, and possibly the file has 483 unique records for that field. When I use some other reference like ids[3] etc(coz it has more unique values), I get the same error about the length being more than 40. 

  I printed out the values for the display_id, primary_id and accession_number. The display_id has the entire first line of the file, so does the primary_id, if I dont set it in my new script. The accession number is split correctly and printed. 

  This is how the first few lines in my file look like, 

  >FGENESHT0000001||AC155633|570|4400|1
MTERKRKEIEDRKRKISGPQPGSSNRPRFSGNQPQQFRQNQRPPQQHQQFQRQYPQHQYQNRQSNQSGGQFQRQNQQAPR
LPAPAAQQNSQATPAQVGNRACFHCGEQGHWVMQCPKKAAQQQSGPNAPAKQNVPQPRAGNRSQPRYNHGRLNHLEAEAV
QETPSMIVGMFPVDSHIAEVLFDTGATHSFITASWVEAHNLPITTMSTPIQIDSAGGRIRADSICLNICVEIRGIAFPAN
LIVMGTQGIDVILGMNWLDKYQAVISCDKRTIKLMSPLGEEVVTELVPPEPKRGSCYQLAVDSSEVDPIESIRVVSEFPD
VFPKDLPGMPPERKVEFAIELLPGTAPIFKRAYRISGPELVELKEQIDELSEKGYIRPSTSPWAAPVLFVEKKDGTKRMC
IDYRALNEVTIKNKYPLPRIEDLFDQLRGASVFSKIDLRSAFFMNLMNSVFMDYLDKFVVVFIDDILVYSQSEEEHADHL
KMVLQRLREHQLYAKLSKCEFWINEVLFLGHIINKEGLAVDPKKVANILNWKAPTDARGIKSFIGMVGYYRRFIEGFSKI
AKPMTALLGNKVEFKWTQKCQEAFEALKEKLTIAPVLVLPDVHKPFSVYCDACYTGLGCVLMQEGRVVAYSSRQLKVHEK
NYPIHDLELAAVVHALKTWRHYLYGQKCDVYTDHKSLKYIFTQSELNMRQRRWLELIKDYELEIHYHPGKANVVADALSR
KSQVNLMVARPMPYELAKEFDRLSLGFLNNSRGVTVELEPTLEREIKEAQKNDEKISEIRRLILDGRGKDFREDAEGVIW
FKDRLCVPNVQSIRELILKEAHETAYSIHPGSEKMYQDLKKKFWWYGMKREIAEHVAMCDSCRRIKAEHQRPAGLLQPLQ
IPQWKWDEIGMDFIVGLPRTRAGYDSIWVVVDRLTKSAHFIPVKTNYSSAVLAELYMSRIVCLHGVPKKIVSDRGTQFTS
HFWRQLHEALGTHLNFSSAYHPQTDGQTERTNQILEDMLRACALQDQSGWDKRLPYAEFSYNNSYQASLKMSPFQALYGR
SCRTPLQWDQPGEKQVFGPDILLEAEENIKMVRENLKIAQSRQRSYADTRRRELSFEVGDFVYLKVSPIRGVKRFGVKGK
LAPRYIGSYQILARRGEVAYQLSLPENLSAVHDVFHVSQLKKCLRVPEEQLPVEGLEVQEDLTYVEKPVQILEVADRVTR
RKTIRMCKVRWNHHSEEEATSEREDDLMAKYPELFASQP*

  Any suggestions?

  Thanks!
  George.

Hilmar Lapp <hlapp at gmx.net> wrote:
  That's odd indeed. Did you try and put a print statement before the 
return statement that proves that 1) the codes gets executed, and 2) 
display_id(), primary_id() and accession_number() have the expected 
values?

BTW you might also want to set primary_id() to undef (as the ID found 
in the FASTA files doesn't really count as a primary database- 
specific ID anyway). The identifier column in bioentry (which 
primary_id() maps to) is constrained to 40 chars as well.

-hilmar

On Jan 27, 2007, at 9:31 PM, George Heller wrote:

> Hi Hilmar,
>
> I tried the lookup and noupdate options, and also made changes to 
> the SeqProcessor.pm package for the accession,
> package SeqProcessor::Accession;
> use strict;
> use vars qw(@ISA);
> use Bio::Seq::BaseSeqProcessor;
> use Bio::SeqFeature::Generic;
> @ISA = qw(Bio::Seq::BaseSeqProcessor);
> sub process_seq
> {
> my ($self, $seq) = @_;
> my @ids = split(/|/,$seq->display_id);
> $seq->accession_number($ids[0]);
> return ($seq);
> }
> 1;
> I invoke the load_seqdatabase.pl as,
> perl load_seqdatabase.pl -host localhost -dbname usda-06 -format 
> fasta -dbuser postgres -driver Pg --lookup --noupdate -- 
> pipeline="SeqProcessor::Accession" maize_pep.fasta
> Loading maize_pep.fasta ...
> I get the error,
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values 
> were ("FGENESHT0000021||AC155633|113788| 
> 114708|-1","FGENESHT0000021||AC155633|113788| 
> 114708|-1","FGENESHT0000021||AC155633|113788|114708|-1","","0","") 
> FKs (1,)
> ERROR: value too long for type character varying(40)
> ---------------------------------------------------
> Could not store FGENESHT0000021||AC155633|113788|114708|-1:
> ------------- EXCEPTION -------------
> MSG: error while executing statement in 
> Bio::DB::BioSQL::SeqAdaptor::find_by_unique_key: ERROR: current 
> transaction is aborted, commands ignored until end of transaction 
> block
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key / 
> home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:951
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key / 
> home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:855
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/akar/ 
> local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:205
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /home/akar/ 
> local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
> STACK Bio::DB::Persistent::PersistentObject::store /home/akar/local/ 
> perl//Bio/DB/Persistent/PersistentObject.pm:272
> STACK (eval) load_seqdatabase.pl:620
> STACK toplevel load_seqdatabase.pl:602
> --------------------------------------
> at load_seqdatabase.pl line 633
> As far as I gather, this error shouldnt appear as we are 
> filtering out the accession as only the first code that appears. 
> Ideas?
>
> George.
>
>
> Hilmar Lapp wrote:
> George,
>
> I don't know you create the FASTA file, but that's probably where the
> root cause is. Based on the message:
>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
>> were ("FGENESHT0000001||AC155633|570|4400|1","FGENESHT0000001||
>> AC155633|570|4400|1","FGENESHT0000001||AC155633|570|4400|
>> 1","","0","") FKs (1,)
>> ERROR: duplicate key violates unique constraint
>> "bioentry_accession_key"
>> ---------------------------------------------------
>
> the identifier and accession number are set, so your SeqProcessor
> scriptlet was executed (otherwise you'd also have seen a dynamic
> loading error it e.g. you perl class could be not be found or loaded
> by perl). If you still receive the duplicate key violation, then it
> can only mean that indeed a sequence with the exact same accession
> number was in the database already.
>
> There are different possibilities for why: you may have loaded the
> same file before (use --lookup and related switches if you want to
> update existing sequences), or your FASTA file contains multiple
> sequences with the same ID, or you have a sequence with the same ID
> in different FASTA files, if you are loading from more than one file.
> In either of the two latter cases, you will need to find a way to
> disambiguate the IDs.
>
> BTW you also want to consider to parse the concatenated ID
> 'FGENESHT0000001||AC155633|570|4400|1' apart into its component IDs,
> and then use only one component. For example:
>
> my @ids = split(/|/,$seq->display_id);
> $seq->accession_number($ids[0]);
>
> Obviously, this will only make for a nicer accession number, and not
> solve your duplicate ID problem, as the latter is in the file(s) you
> load.
>
> -hilmar
>
> On Jan 25, 2007, at 8:51 PM, George Heller wrote:
>
>> Hi Hilmar,
>>
>> I still seem to be having problems loading my fasta file. I wrote a
>> new package, SeqProcessor.pm as below,
>>
>> package SeqProcessor::Accession;
>> use strict;
>> use vars qw(@ISA);
>> use Bio::Seq::BaseSeqProcessor;
>> use Bio::SeqFeature::Generic;
>> @ISA = qw(Bio::Seq::BaseSeqProcessor);
>> sub process_seq
>> {
>> my ($self, $seq) = @_;
>> $seq->accession_number($seq->display_id);
>> return ($seq);
>> }
>> 1;
>> I have this file SeqProcessor.pm in my home directory, and I have
>> set the PERL5LIB variable accordingly. When I run 
>> load_seqdatabase.pl,
>>
>> perl load_seqdatabase.pl -host localhost -dbname biodb -format
>> fasta -dbuser postgres -driver Pg --
>> pipeline="SeqProcessor::Accession" maize_pep.fasta
>>
>> I still get the error,
>>
>> Loading maize_pep.fasta ...
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
>> were ("FGENESHT0000001||AC155633|570|4400|1","FGENESHT0000001||
>> AC155633|570|4400|1","FGENESHT0000001||AC155633|570|4400|
>> 1","","0","") FKs (1,)
>> ERROR: duplicate key violates unique constraint
>> "bioentry_accession_key"
>> ---------------------------------------------------
>> Could not store FGENESHT0000001||AC155633|570|4400|1:
>> ------------- EXCEPTION -------------
>> MSG: error while executing statement in
>> Bio::DB::BioSQL::SeqAdaptor::find_by_unique_key: ERROR: current
>> transaction is aborted, commands ignored until end of transaction
>> block
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /
>> home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:951
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
>> home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:855
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/akar/
>> local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:205
>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /home/akar/
>> local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
>> STACK Bio::DB::Persistent::PersistentObject::store /home/akar/local/
>> perl//Bio/DB/Persistent/PersistentObject.pm:272
>> STACK (eval) load_seqdatabase.pl:620
>> STACK toplevel load_seqdatabase.pl:602
>> --------------------------------------
>> at load_seqdatabase.pl line 633
>> Is there something I am missing?
>>
>> Thanks!
>> George.
>>
>>
>> Hilmar Lapp wrote:
>> Hi George, sorry for the sluggish response, I was tied up during the
>> week. This is also why you always want to keep the thread on the 
>> list.
>>
>> Perl is an interpreted language, so no compilation is necessary. The
>> only thing you need to do is have the package in a place where perl
>> can find it. The simplest way to achieve this is by setting the
>> PERL5LIB environment variable:
>>
>> $ export PERL5LIB=/where/you/put/your/perl/package
>>
>> or if PERL5LIB was set already, you'd append it:
>>
>> $ export PERL5LIB=${PERL5LIB}:/where/you/put/your/perl/package
>>
>> I do assume that you didn't really add your code to the SeqAdaptor.pm
>> package - there is no necessity for nor benefit from that, and at
>> worst (and quite likely) perl won't be able to find the package. Note
>> that there is plenty of documentation for how to write packages for
>> perl and how to make them accessible to perl.
>>
>> Hth,
>>
>> -hilmar
>>
>> On Jan 8, 2007, at 11:52 PM, George Heller wrote:
>>
>>> Hi Hilmer.
>>>
>>> Thanks so much for the response. As I am new to Bioperl, I have
>>> another question.
>>>
>>> I have made the changes as suggested by you, and have added the
>>> code below to the SeqAdaptor.pm script.
>>>
>>> package SeqProcessor::Accession;
>>> use strict;
>>> use vars qw(@ISA);
>>> use Bio::Seq::BaseSeqProcessor;
>>> use Bio::SeqFeature::Generic;
>>>
>>> @ISA = qw(Bio::Seq::BaseSeqProcessor);
>>>
>>> sub process_seq
>>> {
>>> my ($self, $seq) = @_;
>>> $seq->accession_number($seq->display_id);
>>> return ($seq);
>>> }
>>>
>>> Now that I have done my changes, do I need to compile or something
>>> for the changes to reflect? If so, can you please let me know the
>>> command for the same, or direct me to any lin that has
>>> documentation for the same?
>>>
>>> Thanks so much for the help.
>>> George.
>>>
>>> Hilmar Lapp wrote:
>>> George,
>>>
>>> this is almost certainly caused by using FASTA format and bioperl's
>>> treatment of it. I am guilty of not having written a FAQ yet for
>>> Bioperl-db, as this would certainly be there.
>>>
>>> Specifically, the Bioperl fasta SeqIO parser (load_seqdatabase.pl
>>> uses Bioperl to parse sequence files) does not extract the accession
>>> number from the description line of the fasta sequence, and instead
>>> sets the accession_number property if sequence objects it creates to
>>> "unknown". Since there is a unique key constraint on
>>> (accession,version,namespace) the second sequence loaded will raise
>>> an exception as it will violate the constraint.
>>>
>>> The simplest way to deal with this is to write a SeqProcessor that
>>> massages the accession_number appropriately and then supply the
>>> module to load_seqdatabase.pl using the --pipeline command line
>>> switch.
>>>
>>> There are several examples for how to do this in the email archives.
>>> See for example this thread on the Biosql list:
>>>
>>> http://lists.open-bio.org/pipermail/biosql-l/2005-August/000901.html
>>>
>>> with two links to examples, and Marc Logghe gives another one in the
>>> thread itself.
>>>
>>> Hth,
>>>
>>> -hilmar
>>>
>>> On Jan 8, 2007, at 3:17 PM, George Heller wrote:
>>>
>>>> Hi all.
>>>>
>>>> I am new to Bioperl and am trying to run the load_seqdatabase.pl
>>>> script to load sequence data from a file into Postgres database. I
>>>> am invoking the script through the following command:
>>>>
>>>> perl load_seqdatabase.pl -host localhost -dbname biodb06 -format
>>>> fasta
>>>> -dbuser postgres -driver Pg
>>>>
>>>> I am getting the following error:
>>>>
>>>> -------------------- WARNING ---------------------
>>>> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
>>>> were ("FGENES
>>>> HT0000001||AC155633|570|4400|1","FGENESHT0000001||AC155633|570|
>> 4400|
>>>> 1","unknown"
>>>> ,"","0","") FKs (1,)
>>>> ERROR: duplicate key violates unique constraint
>>>> "bioentry_accession_key"
>>>> ---------------------------------------------------
>>>> Could not store unknown:
>>>> ------------- EXCEPTION -------------
>>>> MSG: error while executing statement in
>>>> Bio::DB::BioSQL::SeqAdaptor::find_by_uni
>>>> que_key: ERROR: current transaction is aborted, commands ignored
>>>> until end of t
>>>> ransaction block
>>>> STACK
>>>> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /usr/
>>>> lib/perl
>>>> 5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:948
>>>> STACK
>> Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
>>>> usr/lib/perl5
>>>> /site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:852
>>>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/
>>>> perl5/site_perl/5
>>>> .8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:203
>>>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/
>> perl5/
>>>> site_perl/5.
>>>> 8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
>>>> STACK Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/
>>>> site_perl/5.8.
>>>> 5/Bio/DB/Persistent/PersistentObject.pm:271
>>>> STACK (eval) load_seqdatabase.pl:620
>>>> STACK toplevel load_seqdatabase.pl:602
>>>> --------------------------------------
>>>> at load_seqdatabase.pl line 633
>>>>
>>>> Can anyone tell me how I can correct this error and get my script
>>>> running? Thanks!!!
>>>>
>>>> George.
>>>>
>>>>
>>>> __________________________________________________
>>>> Do You Yahoo!?
>>>> Tired of spam? Yahoo! Mail has the best spam protection around
>>>> http://mail.yahoo.com
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> --
>>> ===========================================================
>>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>>> ===========================================================
>>>
>>>
>>>
>>>
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam? Yahoo! Mail has the best spam protection around
>>> http://mail.yahoo.com
>>
>> -- 
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>>
>>
>>
>>
>> Looking for earth-friendly autos?
>> Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
>
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
>
>
>
> ---------------------------------
> Check out the all-new Yahoo! Mail beta - Fire up a more powerful 
> email and get things done faster.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================

---------------------------------
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.