Bioperl: Re: FTHelper changes

Hilmar Lapp hlapp@gmx.net
Thu, 11 May 2000 13:54:49 +0200


James Gilbert wrote:
> 
> 
> I've been looking over your new code, and have a
> few queries:
> 
> In Bio::SeqIO::FTHelper::_generic_seqfeature if
> _parse_loc fails when making sub_SeqFeatures, we
> are just skipping making a sub_SeqFeature, but I
> think we should fail to add the parent SeqFeature
> (and warn), because the SeqFeature may now be
> invalid (missing an exon or something).

Agreed.

> 
> >From Bio::SeqIO::FTHelper::_parse_loc
> 
>     if($loc =~ /^\s*(\w+[A-Za-z])?\(?\<?(\d+)[ \W]{1,3}\>?(\d+)[,;\" ]*([A-Za-z]\w*)?\"?\)?\s*$/) {
>         #print "1 = \"$1\", 2 = \"$2\", 3 = \"$3\", 4 = \"$4\"\n";
>         $fea_type = $1 if $1;
>         $start = $2;
>         $end   = $3;
>         $tagval = $4 if $4;
>         $sf->start($start);
>         $sf->end($end);
> 
> I guess this is for catching features like "allele
> - replace".  I like the idea of having a separate
> _parse_loc subroutine, but I think we're doing too
> much work in it.  I'd like to explictly catch
> "join", "complement", and "replace" in
> _generic_seqfeature, and have _parse_loc just make
> a Bio::SeqFeature::Generic.  This should make the
> code more explicit, and avoid us having to write
> catch-all pattern matches.
> 

Generally, I partly agree and partly don't. My suggestion would be to not
fiddle around for beautifying purposes only, because the SeqIO code is in no
way beauty anyway. It needs an overhaul, and a clear description how and why
it is in the most generic way comliant with the format definition given by
e.g. GenBank and SwissProt.

The code you are quoting is not the current version. If you change any of the
regexp please do not change the way locations are caught: ([\d\<\>\?]+)
This is to satisfy SwissProt as well, which may have a question mark as the
sole end or start position, indicating the respective end is unknown.

Again, I'd vote against deliberately amending other people's regular
expressions where they already work and are reasonably safe. You never know
what you may break, and if it doesn't turn poorly structured code or
inflexible expressions into well-designed code or adequately flexible
expressions, I think it's mostly not worth the risk. 

>     # now that we've returned the extra location information to the outbound file
>     # remove the extra tags
> 
>     $sub->remove_tag('_part_feature');
> 
> We shouldn't do this.  Someone may want to print
> their Bio::Seq to EMBL and then to GenBank, and
> will get different features in the two files.

I haven't added or changed any code to/in printing methods.

> 
> Finally, we don't (as far as I can see) explicitly
> trap locations which we can't yet model such as:
> 
> 23.45   Fuzzy location (a single base somewhere
>         between 23 and 45 inclusive)
> 
> J00194:(100..202),1..245,300..422
>         Part of this feature is bases 100..202 of
>         entry with accession number J00194.

We do trap these, which you would realize if you run GenBank through the
parser. What happens is that warnings are issued.

> 
> In both cases I think we should fail to return a
> SeqFeature, and issue a warning.
> 

This is exactly what happens.

> I'll go ahead and make these changes myself.
> 

If you do this on the main trunk I'm not that happy with it I have to admit.
It took me 1.5 hours last night just to understand what has happened to the
fixes I had applied before, and that the back-changes re-introduced bugs I
thought I'd fixed. I don't really want to take some hours first every time to
see what has happened to the fixes. I wrote that I ran the code on GenBank
primate section, so there may be some reason to assume that I know what is
trapped and what isn't.

I don't know why but the mail dispatcher I'm connected to at home is unable to
deliver anything to sanger.ac.uk. So, maybe Ewan can forward it to you, so
that you get around the list server delay, which is really annoying for such
topics.

Cheers,

	Hilmar

-- 
-----------------------------------------------------------------------
Hilmar Lapp                                      email: hlapp@gmx.net
NFI Vienna, IFD/Bioinformatics                   phone: +43 1 86634 631
A-1235 Vienna                                      fax: +43 1 86634 727
ROI: Bioinformatics (arrays, expression, seqs), Programming, Databases,
     Mountain Biking (hard tail, hard fork: feel the trail)
-----------------------------------------------------------------------
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================