[BioRuby] GFF3

Pjotr Prins pjotr.public14 at thebird.nl
Sun Jan 2 12:04:48 UTC 2011


The GFF3 plugin works rather well. Anyone who has ruby 1.9.x on his
system can just type as a user:

  gem install bio-gff3

and even bioruby itself gets installed, if needed. Next you can type,
for example

  gff3-fetch mRNA test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3

to assemble all mRNA. 

Unfortunately I am finding some problems with data. For example
the reading frame is *wrong* in this wormbase data file (predicted
gene). The contig starts as:

>MhA1_Contig3426
TTAATAAATTTAATTCATTAAAATTTTAAAAAGAAAGGGACATTCGAGGGGAAATGAGAGAGAACGAGAGAAAATGGACG
GGAAATTAAATTAAAAAATAAAAAATTAATTTTTATTTTTTTTTATTTAATTTAAAATTAATTTTCTACATTTATTAAAT
CTTAAATTATTAATTTTAAATTAATTTAAAG GCATCCAACAACAACAATTAGAAGTCTTTCCCAGCTCCTCCTCTGCCCC
TCAGCAACAACAATACCCAGCGCAGCAGCTTCAATTAGTTACTCCTTTTATTGCATGCATAGCAGATGAATTGAGGGAGT
TGATAGATGAAATGCGTATGTTTTAG AATATTTTTTAAAAAAAAATTAAAAAAAATTTTTTTTTGCCAAACAGGCTCTCG

and the full record is:

##gff-version 3
##sequence-region MhA1_Contig3426 1 2029
# Gene gene:MhA1_Contig3426.frz3.gene1
MhA1_Contig3426 WormBase        gene    192     346     .       +       .       
ID=gene:MhA1_Contig3426.frz3.gene1;Name=MhA1_Contig3426.frz3.gene1;Note=PREDICTE
D protein_coding;public_name=MhA1_Contig3426.frz3.gene1
MhA1_Contig3426 WormBase        mRNA    192     346     .       +       .       
ID=transcript:MhA1_Contig3426.frz3.gene1;Parent=gene:MhA1_Contig3426.frz3.gene1;
Name=MhA1_Contig3426.frz3.gene1;public_name=MhA1_Contig3426.frz3.gene1
MhA1_Contig3426 WormBase        exon    192     346     .       +       .       
ID=exon:MhA1_Contig3426.frz3.gene1.1;Parent=transcript:MhA1_Contig3426.frz3.gene
1
MhA1_Contig3426 WormBase        CDS     192     346     .       +       0       
ID=cds:MhA1_Contig3426.frz3.gene1;Parent=transcript:MhA1_Contig3426.frz3.gene1

So, forward reading frame start at 192 and CDS phase 0. The actual sequence is 

GCATCCAACA ACAACAATTA GAAGTCTTTC CCAGCTCCTC CTCTGCCCCT CAGCAACAAC AATACCCAGC GCAGCAGCTT
CAATTAGTTA CTCCTTTTAT TGCATGCATA GCAGATGAAT TGAGGGAGTT GATAGATGAA ATGCGTATGT TTTAG

which translates to a valid protein only in frame 2(!). This is not
compliant with GFF3 in any interpretation. Turns out for this
particular GFF3 file this is the case only with the *first* ORF on every
contig, and probably a bug of the gene predictor used. None of the
other genes is in the wrong frame.

I have informed Wormbase some time ago, but I don't have the
impression that anyone is interested. You can validate its contents at

  http://www.wormbase.org/db/gb2/gbrowse/m_hapla/?name=id:2258995;dbid=m_hapla:database

I am going to add an option to the GFF3 plugin to test for valid
reading frames, so these files give the expected results. Be good for
validation anyway.

Pj.





More information about the BioRuby mailing list