[Bioperl-l] Some more troubles with HTML module?

Carl Virtanen carl@cimmed.com
Thu, 2 Nov 2000 18:30:56 +1000


Thanks for the help!

I actually decided to spend a bit of time going through this regexp and
figure most of it out because there was one more problem that wasn't being
picked up by the modifications that Steve C. graciously did.
The problem was with gi's like the following:

gi|9631466 ref|NP_048231.1| ORF MSV160 hypothetical prote...    58  0.85
1
gi|9633032 ref|NP_050140.1| hypothetical protein >emb|CAB...    36  0.86
2

These were still being missed by the HTML parser module.

So, going through HTML.pm I found out where the problem was and added the
following lines:


1) In the HSP alignment section.
The first one is the original and the second is my added version:

# GI hits (GenBank Format):  using a nested (())
  s@^>(gi)\|($Word)( +\(($Word)\))( .*)$@<a name=$4_A></a><b>$1:<a
href="$_gi_link$4">$2</a></b>$3$5<br>(<a href="\#$4_H">Back|<a
href="\#top">Top</a>)@o;

  s@^>(gi)\|($Word) ref\|($Word)\|( .*)$@<a name=$2_A></a><b>$1:<a
href="$_gi_link$3">$2|$3</a></b>$4<br>(<a href="\#$2_H">Back|<a
href="\#top">Top</a>)@o;


2. In the 'summary lines at top of report section' of the module.
Again, my addition is the second one.

# GI hits (GenBank Format):  using a nested (())
  s@^ ?gi\|($Word)( +\(($Word)\)) ($Descrip)($Int)  ($Signif)(.*)$@gi:<a
href="$_gi_link$3">$1</a>$2$4$5  <A href="\#$3_A">$6</a>$7<a
name="$3_H"></a>@o;

  s@^ ?gi\|($Word) ref\|($Word)\|($Descrip)($Int +)($Signif)(.*)$@gi:<a
href="$_gi_link$2">$1 ref|$2</a> $3$4<A href="\#$1_A">$5</a>$6<a
name="$1_H"></a>@o;


Not sure if i broke any conventions or could have done it better (comments
are appreciated!) but it works on my results for wu-blast output.

Perhaps these extra lines should be added to the module as well?

Carl Virtanen

> Ewan Birney wrote:
>
> > On Wed, 25 Oct 2000, Carl Virtanen wrote:
> >
> > > Hi folks,
> > >
> > > I'm a little new at checking out some of this stuff, so please bear
with me. I'm using bioperl 6.2.
> > > The problem i'm having is that the output from the Blast->to_html
routine is not picking up all of the correct references and 'htmlifying'
them (see my example near the bottom).  I'm just using the standard kinda
usage:
> > > use Bio::Tools::Blast qw(:obj);
> > > $Blast->to_html(file=>$ARGV[0]);
> > >
> > >  I've narrowed the problem down to the HTML.pm module.  Now, call me a
> > > bonehead (if you wish, but that wouldn't be really nice now would it?)
> > > but the regexps in there are some real bad ass ones (if you'll excuse
> > > my colourful explanation)! So tracking down where the problem is is
> > > not so easy for me.  Actually, if somebody would explain to me at
> > > least one of the regexps,for example:
> > >
> > > s@^ ?(gb|emb|dbj)\|($Word)(\|$Word)?($Descrip)($Int
+)($Signif)(.*)$@$1:<a hre
> > > f="$DbUrl{'gb_n'}$2">$2$3</a>$4$5<A href="\#$2_A">$6</a>$7<a
name="$2_H"></a>@o;
> >
> > Apologies. That looks like a _beast_.
> >
>
> C'mon guys, it's not that bad! (At least there aren't any nested parens in
this example, as there are in some of the others ;). All of the HTML
formatting is achived by a set of substitution regexps that attempt to
identify
> the database, sequence id, etc. and then substitute in HTML links to
either external resources or to internal positions in the document.
>
> So, for example, the <A href="\#$2_A">$6</a> bit creates an internal link
from the E-value in the description line to the alignment section further
down in the report. The <a name="..."> bit creates an internal target so you
> can link back to the description line from the alignment section, which
gets processed by a different substitution regexp. It gets easier to
understand these after you stare at them for a minute or two.
>
> This is a good example of programming by regexp that is only possible in
perl (well, easier to do in perl than in other languages). Every line in the
Blast report is analyzed by the same set of regexps.  Matching lines are
> processed appropriately by the associated substitution. The little 'o' at
the end compiles the regexps once for efficiency.
>
> I just updated the Blast::HTML module to deal with lines like Carl
reported. You can obtain this updated version at
ftp://bio.perl.org/pub/sac/blast/HTML.pm. Just replace the old version of
Bio/Tools/Blast/HTML.pm with this
> file (unless you have other customizations that you want to save, in which
case do a diff).
>
> >
> > This one is for stevec if he is tuning in. I have to admit, I tend to
> > generate html files myself by going through the loops
> >
> > foreach $hit ( ... ){
> >     foreach $hsp ( ... ) {
> >
> >     }
> > }
> >
> > etc. But that is more coding for you....
> >
>
> One advantage of using the built-in HTML formatting functionality of the
Blast module is that you don't have to parse the whole Blast report into
memory before generating HTML. The HTML can be generated line by line from
> STDIN. This can come in handy for large reports that you want to examine
via web browser.
>
> BTW, you don't actually use the Blast::HTML module directly. It is used by
the to_html() function of the Blast module. For an example of usage, see
examples/blast/html.pl in the Bioperl distribution.
>
> Steve
> --
> Steve Chervitz
> sac@neomorphic.com
>
> >
> > >
> > >
> > > then i would be very grateful and would even try to track down the
problem myself and possibly contribute a little to all of this. I'm familiar
with basic regexps/substitution and so on, but yikes!
> > >
> > > Anyways, here's the output, and you can see that it's missing a bunch
of gi's. The search was just a routine peek at some proteins in the nr
database:
> > >
> > > Sequences producing significant alignments:
(bits)  Value
> > >
> > > emb|CAB55683.1| (AL035427) dJ769N13.1 (KIAA0443 protein.) [Homo ...
214  2e-54
> > > ref|NP_055525.1| KIAA0443 gene product >gi|7512985|pir||T00068 h...
214  2e-54
> > > dbj|BAB14367.1| (AK023031) unnamed protein product [Homo sapiens]
181  2e-44
> > > gb:AAF64273.1|AF208859_1 (AF208859) BM-017 [Homo sapiens] >gi|82...
123  6e-27
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Carl Virtanen
> > >
> > >
> > >
> >
> > -----------------------------------------------------------------
> > Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> > <birney@ebi.ac.uk>.
> > -----------------------------------------------------------------
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>