[Bioperl-l] split location problems

Tue Oct 17 16:53:19 UTC 2006

> From: Jason Stajich [mailto:jason.stajich at gmail.com]
> 
> The whole point of split locations is to represent genes with 
> introns  
> so that is not the "rare" case.

Absolutely.

> I have processed the genbank fungal genomes into GFF3 and 
> have had no  
> problems so I'm confused where you are breaking down.  If I write  
> them out as embl I also get the correct thing.  This is using 
> the CVS  
> version of bioperl from the HEAD.
> 
> I've added code to test this to bug 2101 including a C.glabrata  
> chromsome downloaded from genbank.  Perhaps the problem is on the  
> EMBL parsing side, I didn't test that.

Well, I don't know whether it's EMBL parsing, or a bit further down the
pipe, but I downloaded C.glabrata chromosome B for GenBank (NC_005968),
and it describes the complement/joins in the way that Bioperl is
handling correctly.

GenBank:
     CDS             complement(join(10347..10372,10632..11157))
                     /locus_tag="CAGL0B00242g"

EMBL:
FT   CDS
join(complement(10632..11157),complement(10347..10372))
FT                   /locus_tag="CAGL0B00242g"

Here's the diff when I run the location-printing script I posted
yesterday:

diff biogb bio
1c1,5
< complement(join(10347..10372,10632..11157))
---
> complement(1701..2651)
> complement(2635..3345)
> complement(3980..4408)
> complement(join(10632..11157,10347..10372))
> 10379..10615
209a214,217
> 498198..498890
> 499712..500062
> 499851..500702
> 500579..501364

As you can see, the complement/join CDS is written out in a different
order, which is Bad.

(I looked at at least one of the other differences: the GB file says
it's a "misc feature" and EMBL says it's a CDS. But they don't seem to
be relevant here.)

-Amir

> 
> On the technical side, I still am not sure I fully know where the  
> strand information should be stored - the top level container or the  
> sub-features.  I'll try and stay up on the discussion if 
> anything has  
> been decided that I should know about.
> 
> -jason
> 
> 
> 
>