[BioRuby] SPTR problem

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Fri Jan 15 17:19:12 UTC 2010


Hi,

On Tue, 12 Jan 2010 22:52:42 +1000
Ben Woodcroft <donttrustben at gmail.com> wrote:

> Hi,
> 
> While parsing all the yeast UniProt txt files I came across a problem with
> the gn parser - it was returning an array when I expected a hash. Looking at
> the code the problem seems to be this when statement:
> 
>       when /Name=/,/ORFNames=/
>         @data['GN'] = gn_uniprot_parser
>       else
>         @data['GN'] = gn_old_parser
>       end
> 
> http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line:
> 
> GN OrderedLocusNames=YMR084W;
> 
> So GN line had OrderedLocusNames= but not  Name= or ORFNames=, so it didn't
> use the new parser, like the other entries I came across. Should all 4
> possibilities be tested for in the when statement: (Synonyms= being the
> 4th)?

It seems to be a bug. Perhaps there were no (or very few) entries
which only had OrderedLocusNames= when the code was first written
in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d.

The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. 
The document http://www.uniprot.org/docs/sp_news.htm says:
| The new format of the GN line is:
| 
| GN   Name=<name>; Synonyms=<name1>[, <name2>...]; OrderedLocusNames=<name1>[, <name2>...];
| GN   ORFNames=<name1>[, <name2>...];
| 
| None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.

You are right the 4 possibilities should be considered.
"Synonyms" can be eliminated, but it may be safe to be included.

> Also, while I'm here:
> * why does the returned hash have different keys than are in the file? e.g.
> ORFNames becomes :orfs?

I don't know. Now, I think using the same names as described
in the original entries may be preferred, too.

> * I also found the parsing process for whole genomes quite slow (multiple
> hours for well annotated ones).

Please use profiler to find bottlenecks.
 % ruby -rprofile xxx.rb

> * is there any standard way to handle concatenated UniProt files? I wrote my
> own as it was simple.

What type of "concatenated" do you mean?
For simple concatenation, for example, original file distributed
from UniProt FTP site, Bio::FlatFile can be used.
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
(please gunzip before reading!)

 ff = Bio::FlatFile.open("uniprot_sprot.dat")
 ff.each do |e|
   puts e.entry_id
 end

> 
> Thanks,
> ben

Thank you.

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org



More information about the BioRuby mailing list