[Bioperl-l] split location problems

Chris Fields cjfields at uiuc.edu
Tue Oct 17 19:05:59 UTC 2006

> > From: Jason Stajich [mailto:jason.stajich at gmail.com]
> >
> > The whole point of split locations is to represent genes with
> > introns
> > so that is not the "rare" case.
> Absolutely.

Right, but that specific kind of join statement is not commonly used  in
GenBank files, which seems to be the format predominately used (no offense
to EBI).  This may explain why we haven't seen this pop up more often.  

I believe we're seeing is a difference in the way these locations are
described at NCBI vs EBI, which Nadeem Faruque seems to corroborate.  He
indicated that EBI may move to using similar GenBank-like location strings.
Regardless, FTlocationFactory and Bio::Location::Split should handle both if
they are present but only seems to like the GenBank version.

> > I've added code to test this to bug 2101 including a C.glabrata
> > chromsome downloaded from genbank.  Perhaps the problem is on the
> > EMBL parsing side, I didn't test that.
> Well, I don't know whether it's EMBL parsing, or a bit further down the
> pipe, but I downloaded C.glabrata chromosome B for GenBank (NC_005968),
> and it describes the complement/joins in the way that Bioperl is
> handling correctly.
> GenBank:
>      CDS             complement(join(10347..10372,10632..11157))
>                      /locus_tag="CAGL0B00242g"
> FT   CDS
> join(complement(10632..11157),complement(10347..10372))
> FT                   /locus_tag="CAGL0B00242g"

Yes, something that I found out independently (and corroborated by Nadeem).

> Here's the diff when I run the location-printing script I posted
> yesterday:
> diff biogb bio
> 1c1,5
> < complement(join(10347..10372,10632..11157))
> ---
> > complement(1701..2651)
> > complement(2635..3345)
> > complement(3980..4408)
> > complement(join(10632..11157,10347..10372))
> > 10379..10615
> 209a214,217
> > 498198..498890
> > 499712..500062
> > 499851..500702
> > 500579..501364
> As you can see, the complement/join CDS is written out in a different
> order, which is Bad.

I think this can be handled directly in to_FTstring().  I'll have to add a
method to get the strand info from the Split object w/o going through

However, I'm thinking about trying a different tact which is a bit simpler
and, if it proves fruitful, may simplify Split locations somewhat.  It won't
be ready for 1.5.2 but maybe the next release.

> (I looked at at least one of the other differences: the GB file says
> it's a "misc feature" and EMBL says it's a CDS. But they don't seem to
> be relevant here.)
> -Amir

Probably not but something to keep in mind.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign

More information about the Bioperl-l mailing list