[BioRuby] Problem with Bio::GFF::GFF2

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Wed Jun 10 06:14:30 UTC 2009


On Tue, 9 Jun 2009 17:24:38 +0300
George Githinji <georgkam at gmail.com> wrote:

> Thank you so much Naohisa for the excellent explanation!!
> however
> 
> bep_gff.records.each do |record|
>    p record.seqname
> end
> 
> returns
> "seq1   bepipred-1.0b epitope          1     1   0.173  . .   ."
> 
> 
> which is not what is intended and
> record.score, record.start etc all return nil.

It seems this is NOT a valid GFF2 format.
In GFF formats, delimiter must be a TAB ("\t" in Ruby).
However, in above data, it seems that characters between
"seq1" and "bepipred-1.0b" entry may be white spaces
(" " in Ruby), instead of a TAB.

Copy-and-paste from terminal or web browser, or
autocomlete function in a text editor or wordprocessor
can often create such kind of degenerated data.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


> 
> :(
> 
> 
> 
> 
> 
> On Tue, Jun 9, 2009 at 4:44 PM, Naohisa GOTO
> <ngoto at gen-info.osaka-u.ac.jp>wrote:
> 
> > Hi George,
> >
> > On Tue, 9 Jun 2009 15:26:45 +0300
> > George Githinji <georgkam at gmail.com> wrote:
> >
> > > Hi all,
> > > I am try to parse a GFF file. The file looks like this
> > >
> > > ##gff-version 2
> > > ##source-version bepipred-1.0b
> > > ##date 2009-06-09
> > > ##Type Protein seq1
> > > # seqname            source        feature      start   end   score  N/A
> >   ?
> > > #
> > >
> > ---------------------------------------------------------------------------
> > > seq1   bepipred-1.0b epitope          1     1   0.173  . .   .
> > > seq1   bepipred-1.0b epitope          2     2  -0.043  . .   .
> > > seq1  bepipred-1.0b epitope          3     3  -0.014  . .   .
> > > seq1   bepipred-1.0b epitope          4     4   0.144  . .   .
> > > seq1   bepipred-1.0b epitope          5     5   0.250  . .   .
> > > seq1   bepipred-1.0b epitope          6     6   0.218  . .   .
> > >
> > > ....truncated
> >
> > The above GFF records do not contain any "attributes".
> > The field definition of each GFF line is:
> > <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
> > [attributes] [comments]
> >
> > When talking about GFF, the word "attributes" points the
> > "attributes" field in each GFF line.
> >
> > See the GFF2 specifications document for details.
> > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> >
> > > and i have written the following lines with an aim of extracting the
> > start,
> > > end and score attributes. but before that i wanted to know whether the
> > full
> > > attributes are available. so i did the following.
> > >
> > > require 'rubygems'
> > > require 'bio'
> > > bep_gff = Bio::GFF::GFF2.new(File.open('/home/george/bpred.gff'))
> > >
> > >  bep_gff.records.each do |record|
> > >     puts record.attributes_to_hash.inspect
> > > end
> > >
> > > However, i get empty hashes.
> > > Any ideas?
> >
> > Because the Bio::GFF2::Record#attributes_to_hash method returns
> > "attributes" as a hash, and all "attributes" field in the above
> > GFF2 records are empty, showing empty hashes is logically right.
> >
> > If you really want a hash, adding each field into a hash would
> > be the easiest way. For example,
> >
> >  bep_gff.records.each do |record|
> >      h = {}
> >     h['seqname']    = record.seqname
> >     h['source']     = record.source
> >     h['feature']    = record.feature
> >     h['start']      = record.start
> >     h['end']        = record.end
> >     h['score']      = record.score
> >     h['strand']     = record.strand
> >     h['frame']      = record.frame
> >     h['attributes'] = record.attributes_to_hash
> >     p h
> >  end
> >
> > Bio::GFF2::Record have seqname, source, feature, start, end,
> > score, strand, frame attributes(so called in the Ruby language),
> > which are inherited from Bio::GFF::Record class.
> > Normally, it is natural using the above attributes(in Ruby)
> > directly without creating a hash.
> >
> > Note that using attributes_to_hash may lost some data when
> > there are two or more values with the same tag name in an
> > "attributes" field.
> >
> > When creating new data, in case using "attributes" extensively,
> > GFF3 is recommended, because the design of GFF2 attributes is
> > somehow broken.
> >
> > > Thank you
> > >
> > >
> > > --
> > > ---------------
> > > Sincerely
> > > George
> > >
> > > Skype: george_g2
> > > Blog: http://biorelated.wordpress.com/
> >
> > Your blog is nice!
> >
> > --
> > Naohisa Goto
> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> >
> 
> 
> 
> -- 
> ---------------
> Sincerely
> George
> 
> Skype: george_g2
> Blog: http://biorelated.wordpress.com/
> 





More information about the BioRuby mailing list