[BioRuby] Parsing ClustalW files

Pjotr Prins pjotr.public14 at thebird.nl
Sun Dec 27 16:07:47 UTC 2009


On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have
added a unit test for ClustalW ALN format, as well as an update to the
tutorial. 

I have three comments. First I think the alignment parser belong in
./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb.
I can see how that originated, but it is an independent database
format. This should also change the constructor call to, for example,
Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files
are ubiquous we may want to rename this to an ALN format.

Second, I added an index method [], to Bio::ClustalW::Report, so I can
refetch a Bio::Sequence object *with* the ID/definition (see below).
However it may be more appropriate to have this shared at the
Bio::Alignment level. If you have a better way, I am all ears.

   bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join)
   bioruby> aln.header
   ==> "CLUSTAL 2.0.9 multiple sequence alignment"

Fetch a sequence

   bioruby> seq = aln[1]
   bioruby> seq.definition
   ==> "gi|115023|sp|P10425|"

Get the partial sequences

   bioruby> seq.to_s[60..120]
   ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD"

Show the full alignment residue match information for the sequences in the set

   bioruby> aln.match_line[60..120]
   ==> "     .     **. .   ..   ::*:       . * : : .        .: .* * *"

Return a Bio::Alignment object

   bioruby> aln.alignment.consensus[60..120]
   ==> "???????????SN?????????????D??????????L??????????????????H?H?D"

I also kinda disagree with the implementation of the current parser
(Report). It has virtually no checking for bad input data, and it
should accept an array of lines in addition to a String. 

Was that three comments already? ;)

Happy new year to everyone, and let 2010 be a strong year for BioRuby
and friends!

Pj.




More information about the BioRuby mailing list