[Bioperl-l] Bio::Location::Split

Mon Oct 16 19:00:28 UTC 2006

> -----Original Message-----
> From: Chris Fields [mailto:cjfields at uiuc.edu] 
> >
> > I'm converting CDS files from one format to another. E.g., I read an
> > EMBL file with a chromosome and CDS features, and want to output the
> > location in a FASTA header.> > 
> > I get the wrong results for multi-exon CDSs on the -1 strand, as
> > described in the bug report.
> > 
>
> Could you let me know specifically which EMBL file contains the odd
> locations?  The bug report uses theoretical locations, not 
> actual ones, so
> it would be nice to have a real-world example to test against. 

I downloaded candida glabrata chromosome B from EBI:
http://www.ebi.ac.uk/genomes/eukaryota.html, CR380948

testportal>perl location.pl new_glabrata_B.embl > bio
testportal>perl -wlne 'print $1 if /^FT\s+CDS\s+(.*)/'
new_glabrata_B.embl > nonbio
testportal>wc bio nonbio
 217  217 4537 bio
 217  217 4549 nonbio
 434  434 9086 total
testportal>diff bio nonbio
4c4
< complement(join(10632..11157,10347..10372))
---
> join(complement(10632..11157),complement(10347..10372))

Just one example here, but see below.

> As for the lack of catching this, the particular types of 
> locations that
> cause the issue are quite rare.  

Really? I guess our definitions of rare depend on which sequences we're
working with. I'm doing fungal genomes, and here's a grep for a few
species' entire genomes:

testportal>foreach i ( *.embl )
foreach? echo $i
foreach? grep CDS $i | grep join | grep -c complement
foreach? end
glabrata_orf.embl
29
hansenii_orf.embl
151
lactis_orf.embl
70
lipolytica_orf.embl
337
pombe_orf.embl
1137

You might like to use pombe as a test case, as it has lots of these
complement joins, including ones with multiple introns.

Anyway, I'd question the "rare" designation. It seems to me like any
species that has introns will have situations like this in their CDSs.
Not to mention any other sequence that uses Bio::Location::Split. (Since
I'm not a Real Biologist, I can't think up mor examples here, but I'm
sure they exist.)

Or are you saying it's rare to use join (complement(C..D),
complement(A..B)) instead of complement(join(A..B, C..D)). In that case,
I guess I just got really unlucky in that five fungal genomes I was
using decided to use the "rare" syntax. 

> Note that there are two bugs 
> for that bug
> report.  The first (and more serious) is still unresolved.  The second
> (where remote locations are treated differently in 
> Location::Split, which
> caused more problems than it was worth) had a fix committed 
> about a month
> ago.  

Sadly, it's the first (and in my case, more common (I have no remote
locations.)) bug for me.

> Any fixes I have made for the first bug invariably break several other
> methods, which use the current Location::Split object logic 
> for retrieving
> sequences, building feature strings, etc.  Since a new RC is 
> imminent and
> the bug only affects a small number of locations, I have held 
> off until
> after a final release is made (the last thing I want to do is 
> fix something
> that breaks ~6-8 other methods), but I'll try looking at it 
> again this week.

IMO this is a pretty serious bug (if these kinds of sequences aren't
that rare as I've shown above), because you're outputting sequence
descriptions that are just plain wrong. Anyone who uses
FTLocationFactory to read these output description will have incorrect
sequence, incorrect translated proteins, etc. And it's even more serious
if other methods are depending on it.

I know I can't dictate your time, and should be volunteering to work on
fixing it. But if it affects other modules, then I will no doubt break
things even more than you have in your attempts.  

-Amir

> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
> 
>