[Bioperl-l] Getting GFF output in UCSC-specific format
Jason Stajich
jason at bioperl.org
Sun Feb 11 23:29:16 UTC 2007
I assume you are getting your features from a Bio::SeqIO parse of a
Genbank file?
you get back a Bio::SeqFeature::Generic objects so you want to look
at the docs for that module to see what the API is.
you will need to set frame via
$feature->frame($frame)
You are going to have to determine the frame yourself if that isn't
part of the feature, we don't calculate it for you.
For the 9th column, this is available through the tags methods
has_tag, add_tag_values, get_tag_values, get_all_tags, and remove_tag
so you can remove all the tags you don't want through remove_tag (or
if you want to remove them all)
my $locus;
for my $tag ( $feature->get_all_tags ) {
if( $tag eq 'locus_tag' ) { # save the locus_tag when we see it
($locus) = $feature->get_tag_values($tag);
}
$feature->remove_tag($tag);
}
You will also want to set the GFF format when you call
Bio::Tools::GFF - I think the UCSC site is only supporting GFF1, I
don't know exactly how you set the tag then when they aren't paired
with key=>value, you'll need to set the tag to 'group' so
$feature->add_tag_value('group', $locus);
If this is all unsatistfactory you can easily write your own GFF
write for your flavor of the data with the
print join("\t",
$feat->seq_id,
$feat->source_tag,
$feat->primary_tag,
$feat->start,
$feat->end,
$feat->score,
$feat->strand > 0 ? '+' : '-',
$feat->frame,
$locus), "\n";
-jason
On Feb 11, 2007, at 9:47 AM, Adam Diehl wrote:
> Good morning folks,
>
> I've got sort of a newbie question regarding how to get gff's out of
> Bio::Tools:GFF objects that are formatted according to the UCSC
> browser
> conventions, described here:
>
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
> (Ignore the custom track headers and what-not. I just need the
> fields to
> be set up according to the descriptions in 1 - 9).
>
> The write_feature($feature) method isn't doing it for me, as I get
> lines
> like the following (newlines excepted):
>
> chr1 EMBL/GenBank/SwissProt gene 1712 2848 . +
> . db_xref=GeneID:4063728;gene=dnaN;locus_tag=MGAS10270_Spy0002
> chr1 EMBL/GenBank/SwissProt CDS 1712 2848 . +
> .
> EC_number=2.7.7.7;codon_start=1;db_xref=GI:94989511,GeneID:
> 4063728;gene=dnaN;locus_tag=MGAS10270_Spy0002;product=DNA+polymerase
> +III%2C+beta+chain;protein_
> id=YP_597611.1;transl_table=11;translation=MIQFSINRTLFIHALNATKRAISTKNA
> IPILSSIKIEVTSTGVTLTGSNGQISIENTIPVSNENAGLLITSPGAILLEASFFINIISSLPDISINVK
> EIEQHQVVLTSGKSEITLKGKDVDQYPRLQEVSTENPLILKTKLLKSIIAETAFAASLQESRPILTGVHI
> VLSNHKDFKAVATDSHRMSQRLIT
> LENTSADFDVVIPSKSLREFSAVFTDDIETVEVFFSPSQILFRSEHISFYTRLLEGNYPDTDRLLMTEFE
> TEVVFNTQSLRHAMERAFLISNATQNGTVKLEITQNHISAHVNSPEVGKVNEDLDIVSQSGSDLTISFNP
> TYLIESLKAIKSETVKIHFLSPVRPFTLTPGDEEESFIQLITPVRTN
>
> As you can see, field 8, which should be frame according to UCSC
> conventions is blank, and field 9, group according to UCSC, has frame,
> along with ID, etc. All this extra stuff causes the UCSC browser to
> choke. First off, it can't identify which features are the same (it
> does
> this by matching the group field), and second, it can't interpret the
> CDS's into translated proteins because it lacks frame data.
>
> Basically what I need to do is, for CDS features, extract frame (or
> codon_start, as it were), from the last field, parse out the integer
> value and store that in field 8 (as frame), then parse out locus_tag
> from the last field, clear out everything else and store the locus_tag
> only in that field (preferably without the qualifier locus_tag=). For
> feature type gene, I just want to do the last step, so that the
> gene and
> CDS features for the same feature have matching group fields that
> are as
> simple as possible. Let me know if this is not clear.
>
> The way I've been trying to do this is by stringifying each gff
> object,
> splitting into an array, @tmp1, splitting @tmp1[8] into @tmp2 with the
> following code: my @tmp2 = split /\;\, $tmp1[8]; and finally,
> trying to
> parse out the bits I need with regular expressions and store back to
> @tmp1[n]. -- This does not work, because perl wants to interpret
> every
> / + etc. as a metacharacter!
>
> I am assuming there's a simple way to get at each value in the last
> field of the gff object using methods supplied by Bio::Tools::GFF, but
> the API docs seem a bit lacking in this area. Could anyone steer me
> towards what I need to know to do this? Please let me know if I can
> clarify any details!
>
> Cheers,
> Adam Diehl
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
Jason Stajich
Miller Research Fellow
University of California, Berkeley
lab: 510.642.8441
http://pmb.berkeley.edu/~taylor/people/js.html
http://fungalgenomes.org/
More information about the Bioperl-l
mailing list