[Bioperl-l] Getting GFF output in UCSC-specific format

Sun Feb 11 17:47:03 UTC 2007

Good morning folks,

I've got sort of a newbie question regarding how to get gff's out of 
Bio::Tools:GFF objects that are formatted according to the UCSC browser 
conventions, described here:

http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
(Ignore the custom track headers and what-not. I just need the fields to 
be set up according to the descriptions in 1 - 9).

The write_feature($feature) method isn't doing it for me, as I get lines 
like the following (newlines excepted):

chr1    EMBL/GenBank/SwissProt  gene    1712    2848    .       +       
.       db_xref=GeneID:4063728;gene=dnaN;locus_tag=MGAS10270_Spy0002
chr1    EMBL/GenBank/SwissProt  CDS     1712    2848    .       +       
.       
EC_number=2.7.7.7;codon_start=1;db_xref=GI:94989511,GeneID:4063728;gene=dnaN;locus_tag=MGAS10270_Spy0002;product=DNA+polymerase+III%2C+beta+chain;protein_
id=YP_597611.1;transl_table=11;translation=MIQFSINRTLFIHALNATKRAISTKNAIPILSSIKIEVTSTGVTLTGSNGQISIENTIPVSNENAGLLITSPGAILLEASFFINIISSLPDISINVKEIEQHQVVLTSGKSEITLKGKDVDQYPRLQEVSTENPLILKTKLLKSIIAETAFAASLQESRPILTGVHIVLSNHKDFKAVATDSHRMSQRLIT
LENTSADFDVVIPSKSLREFSAVFTDDIETVEVFFSPSQILFRSEHISFYTRLLEGNYPDTDRLLMTEFETEVVFNTQSLRHAMERAFLISNATQNGTVKLEITQNHISAHVNSPEVGKVNEDLDIVSQSGSDLTISFNPTYLIESLKAIKSETVKIHFLSPVRPFTLTPGDEEESFIQLITPVRTN

As you can see, field 8, which should be frame according to UCSC 
conventions is blank, and field 9, group according to UCSC, has frame, 
along with ID, etc. All this extra stuff causes the UCSC browser to 
choke. First off, it can't identify which features are the same (it does 
this by matching the group field), and second, it can't interpret the 
CDS's into translated proteins because it lacks frame data.

Basically what I need to do is, for CDS features, extract frame (or 
codon_start, as it were), from the last field, parse out the integer 
value and store that in field 8 (as frame), then parse out locus_tag 
from the last field, clear out everything else and store the locus_tag 
only in that field (preferably without the qualifier locus_tag=). For 
feature type gene, I just want to do the last step, so that the gene and 
CDS features for the same feature have matching group fields that are as 
simple as possible. Let me know if this is not clear.

The way I've been trying to do this is by stringifying each gff object, 
splitting into an array, @tmp1, splitting @tmp1[8] into @tmp2 with the 
following code:  my @tmp2 = split /\;\, $tmp1[8]; and finally, trying to 
parse out the bits I need with regular expressions and store back to 
@tmp1[n].  -- This does not work, because perl wants to interpret every 
/ + etc. as a metacharacter!

I am assuming there's a simple way to get at each value in the last 
field of the gff object using methods supplied by Bio::Tools::GFF, but 
the API docs seem a bit lacking in this area. Could anyone steer me 
towards what I need to know to do this? Please let me know if I can 
clarify any details!

Cheers,
Adam Diehl