Bioperl: Bio::Species.pm: fixed bug #226

Hilmar Lapp hlapp@gmx.net
Mon, 08 May 2000 10:54:02 +0200


Keith James wrote:
> 
>  Handling rarer locations (123^124) for reading and writing
>  Handling badly wrapped "" within qualifiers
>  Handling locations with < and > for reading and writing
>  Fixed rejection of 1 bp long features on the complement
>  Subspecies handling for EMBL format fixed (to a degree)
> 
> I was concentrating on getting EMBL parsing to work for us as quickly
> as possible, so I neglected Genbank support due to lack of
> time. Likewise, I've had no time to move the fixes to the 0.6
> release. (I'm not familiar with CVS, especially on multiple branches).
> 

I've now succeeded in fixing the code in FTHelper.pm, and as well as a couple
of lines in genbank.pm and embl.pm (Bio::SeqIO). Initialization of species was
still not correct in the latter, the problem being that genus is duplicated in
the classification array if you push genus and species onto the array. Anyway,
this is not burning stuff, but I've added an according test to t/SeqIO.t,
which is already comitted, so you can reproduce it if you wish.

Concerning the unexpected features problem in FTHelper.pm, I found that
Keith's changes already fixed the 123^124 kind of location, and it would do
this obviously for GenBank as well, as FTHelper.pm is used by both parsers.
The replace() location would still fail. As we can expect more new feature
"types" to be invented in the near future, I think that an adequate solution
1) shall not stop execution of the calling program upon an unrecognized
feature entry unless the caller explicitely so wishes,
2) shall be flexible enough to allow for feature types which are not known yet
(although it is clear, that some consensus about the expected syntax is
required).
The flexibility is a particular issue. With the changes mentioned below in
effect I'm presently running a regression test against the complete primate
section of the most recent GenBank release (117.0) [I just can't affort to
load GenBank in total on my home computer :) ] , and there are still coming up
entries which make a problem. I'll post the output of the warn()s later, as
soon as it is completed.

In order to make a first move towards this direction, I tried to streamline
the code in FTHelper.pm/_generic_seqfeature() a little bit, allow for any
string instead of "replace", store this string as a tag, with the value
determined by what comes next to the location within the bracketing construct
(e.g., replace(23..24,"at")).

So far, I have been working on the main branch, but once this works I suggest
to simply take over the respective files into the 06 branch. I've added
genbank entries for testing as well, and more tests to t/SeqIO.t, which can be
copied to the 06 branch as well, thereby testing whether some basic code
breaks.

Note that I have no idea whether or not the SwissProt parser uses FTHelper.pm
as well or has its own feature parser, and likewise for all other formats
apart from GenBank and EMBL. Similarly, I didn't do anything with the code
*writing* entries, so I have no idea whether this is still correct or affected
at all.

Cheers,

	Hilmar

-- 
-----------------------------------------------------------------------
Hilmar Lapp                                      email: hlapp@gmx.net
NFI Vienna, IFD/Bioinformatics                   phone: +43 1 86634 631
A-1235 Vienna                                      fax: +43 1 86634 727
ROI: Bioinformatics (arrays, expression, seqs), Programming, Databases,
     Mountain Biking (hard tail, hard fork: feel the trail)
-----------------------------------------------------------------------
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================