[BioRuby] GFF attributes

Fri Sep 5 09:21:02 UTC 2008

Hi,

When extracting attributes from a GFF file,
older implementation seem to have eat the last character before ";".
Current, (downloaded very recently from github), does not split well,
as the regular expression search the largest match.

A patch is included, but I am not sure on the specification.
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
The specification says:
> From version 2 onwards, the attribute field must have an tag value  
> structure following the syntax used within objects in a .ace file,  
> flattened onto one line by semicolon separators. Tags must be  
> standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must  
> be quoted with double quotes. Note: all non-printing characters in  
> such free text value strings (e.g. newlines, tabs, control  
> characters, etc) must be explicitly represented by their C (UNIX)  
> style backslash-escaped representation (e.g. newlines as '\n', tabs  
> as '\t').


So, it seems that for proper parsing, quotation with double quote  
should be checked for free text,
and semicolon in that quatation is not a separator
for attributes and semicolon may not be preceeded with back slash.

Anyway, the file I am looking now is not that complex,
and I will go with a quick hack at this time.

Best regards,

Tomoaki

the test program
$ cat test-gff.rb
#!/usr/local/bin/ruby
require 'bio'
gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname  
\"grail3.0116000101\"; proteinId 639579; exonNumber 3\n"
Bio::GFF.new(gff_str).records.each do |fr|
   p fr
end

output after patch
$ /usr/local/bin/ruby test-gff.rb
#<Bio::GFF::Record:0x2b0ef0eb0648 @frame="0", @start="11052",  
@comments=nil, @strand="-", @feature="CDS", @score=".",  
@source="JGI", @attributes={"name"=>"\"grail3.0116000101\"",  
"proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064",  
@seqname="LG_I">

output from current
#<Bio::GFF::Record:0x2b825ff16640 @frame="0", @start="11052",  
@comments=nil, @strand="-", @feature="CDS", @score=".",  
@source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId  
639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I">

older output
#<Bio::GFF::Record:0x1e3674 @end="11064", @seqname="LG_I",  
@frame="0", @start="11052", @comments=nil, @strand="-",  
@feature="CDS", @score=".", @source="JGI", @attributes= 
{"name"=>"\"grail3.0116000101", "proteinId"=>"63957",  
"exonNumber"=>"3"}>

diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ 
bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb

--- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ 
db/gff.rb  2008-09-03 22:24:39.000000000 +0900
+++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900
@@ -122,7 +122,7 @@
        def parse_attributes(attributes)
          hash = Hash.new
          scanner = StringScanner.new(attributes)
-        while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/)
+        while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ 
(.+)/)
            key, value = scanner[1].split(' ', 2)
            key.strip!
            value.strip! if value


-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan