[BioRuby] GFF attributes

Tomoaki NISHIYAMA tomoakin at kenroku.kanazawa-u.ac.jp
Thu Sep 11 02:34:36 UTC 2008


Hi

> To prevent repeating the bug, I want to use the GFF string
> described in your mail for the test script in BioRuby.
> (test/unit/bio/db/test_gff.rb)
> Can you give permission?

Surely, I have no objection.
The string is one of the line in the Popular genome annotation from  
the JGI site.
ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ 
Poptr1_1.JamboreeModels.gff.gz
So, I think acknowledging them is a good idea.

For test string, I think another pattern including multiple value for  
one key is worth to add.
The example from http://www.sanger.ac.uk/Software/formats/GFF/ 
GFF_Spec.shtml:
seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11  
55 ; E_value 0.0003

Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the  
value for 'Target'.
But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more  
sensible, or represent
more of the meaning of the specification.

Since changing this return value will make incompatibilities, I'm not  
sure
whether it can be changed.
But if it is ever to be changed, it is better changed early, or  
stated as such.
If it is too late, perhaps we can make a method under a different  
name so that
currently working code will not be affected.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2008/09/09, at 20:47, Naohisa GOTO wrote:

> Hi,
>
> On Fri, 5 Sep 2008 15:43:05 +0900
> Tomoaki NISHIYAMA <tomoakin at kenroku.kanazawa-u.ac.jp> wrote:
>
>> Hi,
>>
>> When extracting attributes from a GFF file,
>> older implementation seem to have eat the last character before ";".
>> Current, (downloaded very recently from github), does not split well,
>> as the regular expression search the largest match.
>
> Thank you for reporting a bug.
>
>> A patch is included, but I am not sure on the specification.
>> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
>> The specification says:
>>> From version 2 onwards, the attribute field must have an tag value
>>> structure following the syntax used within objects in a .ace file,
>>> flattened onto one line by semicolon separators. Tags must be
>>> standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must
>>> be quoted with double quotes. Note: all non-printing characters in
>>> such free text value strings (e.g. newlines, tabs, control
>>> characters, etc) must be explicitly represented by their C (UNIX)
>>> style backslash-escaped representation (e.g. newlines as '\n', tabs
>>> as '\t').
>
> I also see BioPerl's _from_gff2_string in Bio::Tools::GFF
> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/ 
> Tools/GFF.html#CODE10
> It seems is still has bugs (as described in comments in their code),
> but semicolons inside double quotes are treated as normal letters
> and not separators for attributes.
>
>> So, it seems that for proper parsing, quotation with double quote
>> should be checked for free text,
>> and semicolon in that quatation is not a separator
>> for attributes and semicolon may not be preceeded with back slash.
>
> I've changed to do so. This means the patch was not used.
>
> http://github.com/ngoto/bioruby/commit/ 
> e38fd48aaf41f94eaec39a639a7f6c5db62c22e8
> (This is my repository. Because the change seems severe,
> I'll push to the main bioruby repository later,
> after checking more and more.)
>
> To prevent repeating the bug, I want to use the GFF string
> described in your mail for the test script in BioRuby.
> (test/unit/bio/db/test_gff.rb)
> Can you give permission?
>
> Best regards,
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>>
>> Anyway, the file I am looking now is not that complex,
>> and I will go with a quick hack at this time.
>>
>> Best regards,
>>
>> Tomoaki
>>
>> the test program
>> $ cat test-gff.rb
>> #!/usr/local/bin/ruby
>> require 'bio'
>> gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname
>> \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n"
>> Bio::GFF.new(gff_str).records.each do |fr|
>>    p fr
>> end
>>
>> output after patch
>> $ /usr/local/bin/ruby test-gff.rb
>> #<Bio::GFF::Record:0x2b0ef0eb0648 @frame="0", @start="11052",
>> @comments=nil, @strand="-", @feature="CDS", @score=".",
>> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"",
>> "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064",
>> @seqname="LG_I">
>>
>> output from current
>> #<Bio::GFF::Record:0x2b825ff16640 @frame="0", @start="11052",
>> @comments=nil, @strand="-", @feature="CDS", @score=".",
>> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId
>> 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I">
>>
>> older output
>> #<Bio::GFF::Record:0x1e3674 @end="11064", @seqname="LG_I",
>> @frame="0", @start="11052", @comments=nil, @strand="-",
>> @feature="CDS", @score=".", @source="JGI", @attributes=
>> {"name"=>"\"grail3.0116000101", "proteinId"=>"63957",
>> "exonNumber"=>"3"}>
>>
>> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ 
>> lib/
>> bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb
>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/
>> db/gff.rb  2008-09-03 22:24:39.000000000 +0900
>> +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900
>> @@ -122,7 +122,7 @@
>>         def parse_attributes(attributes)
>>           hash = Hash.new
>>           scanner = StringScanner.new(attributes)
>> -        while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/)
>> +        while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/
>> (.+)/)
>>             key, value = scanner[1].split(' ', 2)
>>             key.strip!
>>             value.strip! if value
>>
>>
>> -- 
>> Tomoaki NISHIYAMA
>>
>> Advanced Science Research Center,
>> Kanazawa University,
>> 13-1 Takara-machi,
>> Kanazawa, 920-0934, Japan
>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>





More information about the BioRuby mailing list