[Bioperl-l] discusion/advice on non-bioperl bioinformatics modules

Sean P. Quinlan seanq@darwin.bu.edu
Thu, 23 Aug 2001 14:26:13 -0400

At 01:52 PM 8/23/01 +0200, Hilmar Lapp wrote:
>Sean Quinlan wrote:
>> =from posting
>> Current functions in CompBio.pm:
>> # note - table format refers to tab delimited, such as
id\tsequence[\n|\tother fields\n]
>> new - create new CompBio object
>> check_type - try to determine what format sequence data is in
>> tbl_to_fa - convert sequence data in table format to fasta
>> tbl_to_ig - convert sequence data in table format to intelligenics
>> fa_to_tbl - convert sequence data in fasta to table format
>> ig_to_tbl - convert sequence data in intelligenics format to table
>> dna_to_protein - convert dna sequence to protein sequence
>> complement - convert dna sequence to it's compliment
>> six_frame - translate dna sequence to protein across all six frames
>> aa_hash - hash lookup of aa using codons as keys - includes ambiguous codes
>> _stop - internal method used by six_frame
>> wu_blast - interface to WUBlast; old, ugly and not portable - next
project after catching up Simple.pm
>> _error - internal method for varying error handling behavior without
extra typing every time
>> Planned (in most cases some code already exists in BMERC::bio or
>> ncbi_blast - interface to NCBI's version of the blast tools
>> parse_blast - simple blast output parser - may need to be seperate
versions for WU and NCBI blasts. Return tab delimited data in consistent
format, such as score, %identity, start/stop positions of match, etc.
>> calculate_scores - calculates %equivalent identities and #effective
identities from blast output
>> dnastar_to_tbl - convert sequence data in dna* format to table
>> tbl_to_dnastar - and back
>> gcg_to_tbl - convert sequence data in gcg format to table
>> tbl_to_gcg - and back
>> ncbi_to_tbl - convert sequence data in ncbi 'format' (as cut and pasted
from .gbk reports or ncbi's website) to table
>> tbl_to_ncbi - and back
>> OK, now I'll try to get to the real reason I am making this post. 
>> I'd like to voulenteer. Regardless of whether or not they get listed, 
>> I would like to offer any code in the modules, or any of the utilities 
>> attached, to the bioperl project. 
>Sean, thanks for offering your time and code, we appreciate that.
>>From your listing I could imagine several useful pieces of code:
>1) SeqIO: table format (what's that?), intelligenic format (sorry,
>never heard of that, what is it used for?), dna* format (again,
>forgive my ignorance, what is it used for), automatic format
>checking (the present code in bioperl determines by extension;
>using actual content has to be switchable, however, because some
>streams may not support rewind).

Table format is just our term for tab delimited, with the constraint that
the signifier is in the first field and the sequence (dna or aa) in the
second field. The latest spec. allows for additional fields after the
sequence, but no whitespace is permitted in the first two. We favor table
format because it is trivial to load into or get out of our database
server. Also, since we are a unix shop, there are a host of command line
tools that can be used for operating on sequence records in table format.

Intelegenics (usually .ig extension) is _old_. We only still support it
because we have some archaic tools from 'back in the day' that still get
pulled out on occasion which use that format. DNA* is another sequence
analysis program (I think Blatner started the company originaly?) which we
see used occasionally.

If that routine would be usefull for SeqIO, please feel free to use it. If
you do, I'd be happy to email the maintainer of that module an update or
patch whenevr there is a change. Or I could (has to happen someday) figure
out how to use CVS and update that part myself.

>2) SeqIO::gcg.pm could certainly need at least a maintainer, i.e,
>someone who takes care that it is up-to-date and supports the
>latest GCG versions (I think it is somewhat outdated, like the
>checksum calculation, but I'm not sure).

I'm probably not the person for that, as I don't use GCG myself; I'm not
even sure if any of the groups I work with have it in use any more (which
is why the converter isn't already installed). I'm expecting to only be
working on it from a spec describing it. However, if SeqIO::gcg.pm isn't
currently maintained let me know. I'll look at the current version and when
I work on the converter for CompBio, I'll try to make sure gcg.pm is up to
date as well. 

>3) Blast parser (BPlite) could probably use more hands, too. Maybe
>you check with Roger Hall (roger@iosea.com) and Jason

Sure. I still need to catch Simple up to CompBio, and then get the basic
blast interface for just launching jobs done, but the blast parser will be
right on the heels of that.  Roger & Jason, please let me know if the
current BPlite is stable for the moment or if you have changes expected to
be made in the next few weeks, and if you have anything in particular that
needs more work (yes, I will look at the TO DO list). Should I drop you a
line when I start porting my blast parser?

>If you'd like to contribute even more, there is plenty of work to
>be done in writing more rigorous tests and fixing bugs.

Unfortunately just getting this project done is taking longer than I hoped,
and ost of the code already exists. We'll see what developes over the next
couple months though.

>As for potential clashes of Bioperl with your CompBio work, I'm
>not sure why there could be name clashes. Regarding the APIs,
>these look very much different, so keeping those interoperable to
>me seems to be a big effort you probably don't want to take. Maybe
>I'm missing something.

I don't think your missing anything; I'm not entirely sure what I mean
either, specifically :) . I'm new to OOP, so my instinct is to try to make
sure the function names are unique to avoid any forseable possibility of
clash. I supose I'll just make a list of all the methods in bioperl in and
monitor it to avoid confusion wherever possible, just to be sure. I guess
I'm a little paranoid. As for the API's, no I don't expect them to be
similar, probably quite the opposite. I'm just want to make sure CompBio
plays nice with others, and don't expect I know all the ways things can clash.

Thanks Hilmar!

Sean P. Quinlan
"You can discover more about a person in an hour of play than in a year of
conversation" - Plato