[Bioperl-l] Getting GFF output in UCSC-specific format
Adam Diehl
agd27 at cornell.edu
Sun Feb 11 17:47:03 UTC 2007
Good morning folks,
I've got sort of a newbie question regarding how to get gff's out of
Bio::Tools:GFF objects that are formatted according to the UCSC browser
conventions, described here:
http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
(Ignore the custom track headers and what-not. I just need the fields to
be set up according to the descriptions in 1 - 9).
The write_feature($feature) method isn't doing it for me, as I get lines
like the following (newlines excepted):
chr1 EMBL/GenBank/SwissProt gene 1712 2848 . +
. db_xref=GeneID:4063728;gene=dnaN;locus_tag=MGAS10270_Spy0002
chr1 EMBL/GenBank/SwissProt CDS 1712 2848 . +
.
EC_number=2.7.7.7;codon_start=1;db_xref=GI:94989511,GeneID:4063728;gene=dnaN;locus_tag=MGAS10270_Spy0002;product=DNA+polymerase+III%2C+beta+chain;protein_
id=YP_597611.1;transl_table=11;translation=MIQFSINRTLFIHALNATKRAISTKNAIPILSSIKIEVTSTGVTLTGSNGQISIENTIPVSNENAGLLITSPGAILLEASFFINIISSLPDISINVKEIEQHQVVLTSGKSEITLKGKDVDQYPRLQEVSTENPLILKTKLLKSIIAETAFAASLQESRPILTGVHIVLSNHKDFKAVATDSHRMSQRLIT
LENTSADFDVVIPSKSLREFSAVFTDDIETVEVFFSPSQILFRSEHISFYTRLLEGNYPDTDRLLMTEFETEVVFNTQSLRHAMERAFLISNATQNGTVKLEITQNHISAHVNSPEVGKVNEDLDIVSQSGSDLTISFNPTYLIESLKAIKSETVKIHFLSPVRPFTLTPGDEEESFIQLITPVRTN
As you can see, field 8, which should be frame according to UCSC
conventions is blank, and field 9, group according to UCSC, has frame,
along with ID, etc. All this extra stuff causes the UCSC browser to
choke. First off, it can't identify which features are the same (it does
this by matching the group field), and second, it can't interpret the
CDS's into translated proteins because it lacks frame data.
Basically what I need to do is, for CDS features, extract frame (or
codon_start, as it were), from the last field, parse out the integer
value and store that in field 8 (as frame), then parse out locus_tag
from the last field, clear out everything else and store the locus_tag
only in that field (preferably without the qualifier locus_tag=). For
feature type gene, I just want to do the last step, so that the gene and
CDS features for the same feature have matching group fields that are as
simple as possible. Let me know if this is not clear.
The way I've been trying to do this is by stringifying each gff object,
splitting into an array, @tmp1, splitting @tmp1[8] into @tmp2 with the
following code: my @tmp2 = split /\;\, $tmp1[8]; and finally, trying to
parse out the bits I need with regular expressions and store back to
@tmp1[n]. -- This does not work, because perl wants to interpret every
/ + etc. as a metacharacter!
I am assuming there's a simple way to get at each value in the last
field of the gff object using methods supplied by Bio::Tools::GFF, but
the API docs seem a bit lacking in this area. Could anyone steer me
towards what I need to know to do this? Please let me know if I can
clarify any details!
Cheers,
Adam Diehl
More information about the Bioperl-l
mailing list