[BioRuby] Bioruby HTML output

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Wed Jan 13 02:58:00 UTC 2010


Hi,

On Tue, 12 Jan 2010 11:11:32 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
> > I'm not sure whether the prefix Bio::Html is suitable or not.
> 
> Me neither ;). This is something to discuss when we meet. See my
> write up on partitioning based on functionality or standards.
> 
> > By the way, I'v tried some of your code in
> > http://github.com/pjotrp/bioruby/blob/color-alignment/
> > and found potential XSS.
> > 
> >   a = Bio::Alignment.new
> >   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
> >   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
> >   a.add_seq('<script>alert("c");</script>', 'c')
> >   simple = Bio::Html::HtmlAlignment.new(a,
> >           :title => '<script>alert("title");</script>')
> >   html = simple.html()
> >   File.open('/tmp/xss.html', 'w') { |w| w.print html }
> > 
> > For sequences, sequence names, and consensus lines,
> > using CGI.escapeHTML() will always be needed.
> >
> > For the :title, if script users can set the title, it
> > should be escaped, but this prevents script programmers
> > using html tags in the title.
> 
> Perhaps the HTML generator should escape its output. Though I
> personally think we should only be worried about security concerns
> when people *enter* new data on input forms. That is when exploits
> show up. I can argue that HTML generation should not concern itself
> with HOW the inputs are presented. One advantage of having a
> programmer set the 'title' is that he *can* embed HTML. Perhaps
> escaping HTML is the responsibility of the programmer providing the
> data. And therefore to the logic that handles input.

Even apart from security, sequence names (and sequences) that
contain html special characters may not be correctly displayed.

For example, sequences with three parameters a, b, and c.

% cat test.aln
CLUSTAL 2.0.9 multiple sequence alignment


1<a<3_b>5_c<7       FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ
a>3_5<b<8_c>11      FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ
                    ****::*********************************** ***
% irb -r bio
irb> report = Bio::ClustalW::Report.new(File.read('test.aln'))
irb> alignment = report.alignment
irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => 'a,b,c')
irb> File.open('abc.html', 'w') { |w| w.print simple.html() }

The sequence names were correctly treated by ClustalW 2.0.9,
but unexpected representation.

This problem can not be solved with input data escaping.
If the sequence name "1<a<3_b>5_c<7" is escaped to
"1&lt;a&lt;3_b&gt;5_c&lt;7" before calling the method,
text indentation will be broken because of the mismatch of
text length and html display width. To solve this, to
escape when building the html format by output formatting
method will be needed.

> We have had a similar discussion before. We have to decide to what
> level *output* code should concern itself with *input* security. I
> have a feeling that too much of Bioruby classes try to do too much.
> How do we stay away from cluttering the code? How do we decide that
> callers should not use HTML and handle security concerns?

It is difficult not to use HTML-like string which we want
to be treated as normal unformatted string but unexpectedly
treated as HTML by some programs, e.g. the above example.

For security, I'd like to ask security experts.
Anyone in this list?

I think escaping should be done by formatting layer and
should be turned on by default, because:
* Only the output formatting layer knows how the input data
  is processed.
* In many cases, the data comes from outside, and we can not
  expect it is safe enough.
* Different escaping rules are needed for different output types,
  e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
  Escaping by output methods seems natural, and helps to switch
  output formats without concerning escaping issues specific
  to each output format.

> You write:
> 
> >   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
> 
> If a programmer wants that - it is his concern in my opion. If he is
> concerned about exploits he should not allow it. The Alignment class
> does not care either. It is none of its business.

The example is extreme case. For security, please ask experts.
Apart from the security, I wish ">", "<", "&", etc. can be
displayed correctly. I think methods to build HTML format
should concern this.

> BTW I fixed a number of PAML::Codeml bugs on this branch. So you
> can ignore the existing PAML branch. Let's continue with the color
> coding, assuming you can live with the PAML::Codeml implementation,
> as it stands.

When do you want the Bio::PAML::Codeml code to be merged to the
blessed bioruby repository?


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org



More information about the BioRuby mailing list