[Bioperl-l] writing genbank files

gert thijs gert.thijs@esat.kuleuven.ac.be
Thu, 19 Sep 2002 10:53:39 +0200


Jason,

I have 1.0.2 installed and this works fine apart from the problem when writing 
splitLocations.
To solve this problem, I downloaded the latest version from the main trunk as 
Hilmar suggested.
When testing this new version, I encountered the error with the species name.
The problem with this genbank entry (and I have much more similar entries) is 
the validation of the species name. 'eurosids II' does not match the regex 
/^[A-Z][\sa-z]+$/ and that's why the parser bails out. Other entries do not 
seem to have that problem.

Gert


Jason Stajich wrote:
> I did this fix about 1 week ago, don't see how your parsing would have
> worked before unless genbank parsing changed too from the version you
> were using before..
> 
> All I did was:
> 
> RCS file: /home/repository/bioperl/bioperl-live/Bio/Species.pm,v
> retrieving revision 1.16
> retrieving revision 1.17
> diff -r1.16 -r1.17
> 282c282
> <     return 1 if $string =~ /^[A-Z][a-z]+$/;
> ---
> 
>>    return 1 if $string =~ /^[A-Z][\sa-z]+$/;
> 
> 
> 
> Someone needs to refresh the ideas behind the Species object and the
> taxonomic fields in genbank/embl records.  Either we are parsing things
> differently or the values that one can put in the field are changing.  We
> have a lot more taxonomic fields that are not matching what was expected
> when this module was built (James G did the brunt of the work back in the
> day).
> 
> Anyways, I am perfectly happy to turn off the name validatation
> altogether, basically it required all fields other than the species to
> start with a capital letter.
> 
> -jason
> 
> 
> On Wed, 18 Sep 2002, Hilmar Lapp wrote:
> 
> 
>>I believe Jason fixed something yesterday in Species.pm in order to
>>allow spaces in certain places. Jason?
>>
>>	-hilmar
>>
>>On Wednesday, September 18, 2002, at 01:48 AM, gert thijs wrote:
>>
>>
>>>Hilmar,
>>>
>>>I just installed the modules from the main trunk. I tried to test
>>>it but now I was unable to parse input sequences in genbank format.
>>>Now I have a problem uploading a genbank flat file. There seems to
>>>be a problem while parsing the species name. I guess not having an
>>>upper case starting letter stops the genbank parser. In attachment
>>>you can find a file on which the parser throws the expection.
>>>
>>>------------- EXCEPTION: Bio::Root::Exception -------------
>>>MSG: Invalid name 'eurosids II' (Wrong case?)
>>>STACK: Error::throw
>>>STACK: Bio::Root::Root::throw
>>>/users/sista/thijs/perl/lib/site_perl/5.6.0/Bio/Root/Root.pm:318
>>>STACK: Bio::Species::validate_name
>>>/users/sista/thijs/perl/lib/site_perl/5.6.0/Bio/Species.pm:283
>>>STACK: Bio::Species::classification
>>>/users/sista/thijs/perl/lib/site_perl/5.6.0/Bio/Species.pm:121
>>>STACK: Bio::SeqIO::genbank::_read_GenBank_Species
>>>/users/sista/thijs/perl/lib/site_perl/5.6.0/Bio/SeqIO/genbank.pm:884
>>>STACK: Bio::SeqIO::genbank::next_seq
>>>/users/sista/thijs/perl/lib/site_perl/5.6.0/Bio/SeqIO/genbank.pm:229
>>>STACK: AnnotatedSequence::new
>>>/users/sista/thijs/perl/lib//AnnotatedSequence.pm:66
>>>STACK: GeneIndex.pl:168
>>>-----------------------------------------------------------
>>>
>>>Gert
>>>
>>>
>>>Hilmar Lapp wrote:
>>>
>>>>It should be written as join(complement(...),complement(...),...).
>>>>This is main trunk only though. Do you have an example where this
>>>>is not true?
>>>>    -hilmar
>>>>On Tuesday, September 17, 2002, at 02:06 AM, gert thijs wrote:
>>>>
>>>>>Hello,
>>>>>
>>>>>I have a question about the current status of the genbank file
>>>>>parser/writer.  I noticed that a CDS with a location of the type
>>>>>complement(join()) is written as a join() without the complement.
>>>>>I saw that this problem has been a major thread on the list a few
>>>>>weeks ago, but I could not find if the problem has been solved by
>>>>>now or if it was solved how it should be solved.
>>>>>
>>>>>Gert
>>>>>
>>>>>
>>>>>
>>>>>-- + Gert Thijs
>>>>>+  K.U.Leuven
>>>>>+  ESAT-SCD
>>>>>+  Kasteelpark Arenberg 10
>>>>>+  B-3001 Leuven-Heverlee
>>>>>+  Belgium
>>>>>+
>>>>>+ Tel  : +32 16 32 85 88
>>>>>+ Fax  : +32 16 32 19 70
>>>>>+ email: gert.thijs@esat.kuleuven.ac.be
>>>>>+
>>>>>+  http://www.esat.kuleuven.ac.be/~thijs
>>>>>+  http://www.esat.kuleuven.ac.be/~dna/BioI/
>>>>>+
>>>>>
>>>>>_______________________________________________
>>>>>Bioperl-l mailing list
>>>>>Bioperl-l@bioperl.org
>>>>>http://bioperl.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>>-- -------------------------------------------------------------
>>>>Hilmar Lapp                            email: lapp at gnf.org
>>>>GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>>>>-------------------------------------------------------------
>>>>_______________________________________________
>>>>Bioperl-l mailing list
>>>>Bioperl-l@bioperl.org
>>>>http://bioperl.org/mailman/listinfo/bioperl-l
>>>
>>>

-- 
+ Gert Thijs
+  K.U.Leuven
+  ESAT-SCD
+  Kasteelpark Arenberg 10
+  B-3001 Leuven-Heverlee
+  Belgium
+
+ Tel  : +32 16 32 85 88
+ Fax  : +32 16 32 19 70
+ email: gert.thijs@esat.kuleuven.ac.be
+
+  http://www.esat.kuleuven.ac.be/~thijs
+  http://www.esat.kuleuven.ac.be/~dna/BioI/
+