[BioRuby] RFC Caching (was BioRuby standards)

Wed Sep 24 13:38:19 UTC 2008

Hi Pjotr,

I've seen files in your lib/bio/db/microarray, and I suppose
it's still under development and it will be changed frequently,
and I think it's not a time to include them in main bioruby.
So, my comments below are mainly for future improvements.

1. about cache.rb

The "safe = true" argument in 'set' and 'directory' seems
bad idea. I think there is no need to give insecure options
to users.

In 'directory' method, 
>  cache = Dir.mktmpdir(subdir)

The Dir.mktmpdir method is a new feature added in Ruby 1.8.7,
and not available in 1.8.6 and older versions.
Because most users are still using Ruby 1.8.5 and 1.8.6,
to avoid using Dir.mktmpdir is currently a choice.
Alternatively, write a document that the feature can work
only in Ruby 1.8.7 or later.

Note that current requirement of BioRuby is
"Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)".
Also note that FileUtils.remove_entry_secure was introduced
in Ruby 1.8.3.

Finally, I'm wondering if the Cache class can still be
a singleton or not in the future. Currently, only NCBI_GEO
is using the cache, but if it were used from many classes
with different data formats, files with different formats
would be existed in the same cache directory, and file name
conflicts might be happened.

2. About file locations

Below are recommended to be moved to bio/io/,
because their main purpose is file or network I/O,
and not data parsing.
    bio/db/microarray/cache.rb 
    Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb
The class/module names are not needed to be changed.

The files with external dependency to the "biolib" might
also be suggested to be moved from bio/db to the other
location, but no best location found.

3. BIo::Microarray::NCBI_GEO

In bio/db/microarray/ncbi_geo/geo.rb,

>    include REXML

If the aim to include REXML module is only to skip the
REXML:: prefix, I don't like to include it in library,
because the constants and methods defined in REXML are
mixed and they might cause bad side effects.
(Note that unlike in a library, it is free to include
anything in an application.)

> def XML::create(acc)

In my impression, the method name "XML.create" might be
reserved to be used by a method to create XML data structure
from scratch or from some data.

To define a class method, I like 'def self.create(acc)'
because it is easy to change class (module) name.

> def XML::fetch(xmlfn, acc)
>    url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml"

URI escaping is needed, e.g. acc=#{URI.escape(acc)}

>    print "Fetching ",url,"\n" if $VERBOSE
>    r = Net::HTTP.get_response( URI.parse( url ) )

To support proxy, use Bio::Command.get_uri(url).

> def XML::valid_accession?(acc = nil)
>   acc = @acc if not acc
>   acc =~ /^(GSM|GSE|GPL)\d+$/

If "GSM0123\nGSM4567" is invalid, the regular expression
should be /\A(GSM|GSE|GPL)\d+\z/ .

> def XML::parsexml(acc)

Is there no way to get input XML data as String?

>   if XML::valid_accession? acc
>     cache = Cache.instance.directory
>     fn = cache+'/'+acc+'.xml'

Please use File.join.

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Tue, 23 Sep 2008 13:58:52 +0200
pjotr2008 at thebird.nl (Pjotr Prins) wrote:

> Hi Naohisa,
> 
> I fixed the Cache to be secure. It will use a safe Tmpdir if no
> directory is specified and raise SecurityErrors when appropriate.
> 
> See http://github.com/pjotrp/bioruby/tree/master
> 
> Pj.
> 
> On Thu, Sep 18, 2008 at 08:32:37AM +0200, Pjotr Prins wrote:
> > Hi Naohisa,
> > 
> > On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote:
> > > Hi Pjotr,
> > > 
> > > If you don't want to implement any access control,
> > > using world writable directory like /tmp (comes from
> > > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled,
> > > because this is vulnerable to a symbolic link attack.
> > > 
> > > About symbolic link attack, please refer documents:
> > > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx
> > > (Note that Ruby's standard TempFile has no problem.)
> > 
> > I agree - assuming you are running a webservice for microarrays.
> > 
> > > When the "cache" directory isn't explicitly specified
> > > by user by using the environment variable BIORUBY_CACHE
> > > (or command-line options of custom application),
> > > doing without cache should be the default.
> > 
> > NCBI won't be happy with that. But if that is what Bioruby wants...
> > It is not only about my own bandwidth ;-). 
> > 
> > > It is also good to raise SecurityError when the specified
> > > directory is writable by everyone.
> > 
> > I'll remove tmpdir - I introduced it because of an earlier mail.
> > 
> > Disabling the cache is easy - off course. Another option is to use
> > TmpFiles and keep track of those in a Hash (I'd rather not have large
> > IO objects in memory). OK, that is what I'll implement - assuming you
> > want to include the microarray stuff in Bioruby.
> > 
> > Pj.
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby