[Bioperl-l] Protein alignment CD excision module

Wed Aug 31 12:36:14 EDT 2005

Steve,

I can see the usefulness of what you are doing, but bioperl is a library and 
needs to think modularly so that other users can easily modify it. What you 
are describing is a best implemented as a script that uses several modules.
That example script could be stored in BioPerl separately.

On Wednesday 31 August 2005 15:12, Stephen Gordon Lenk wrote:
> I am converting a module that takes a ClustalW alignment, data mines
> the conserved domains from NCBI, then selectively replaces the CDs
> with IUPAC 'X' and writes a ClustalW file back out. We have several
> uses for this module's functions.

Reading and writing an alignment is already handled by Bio::AlignIO. If you 
hardcode the format in a module, you loose flexibility. So this belongs to a 
script.

"data mines the conserved domains from NCBI"

This needs to be done separately by writing, e.g., a Bio::DB or a 
Bio::Tools::Analysis module for accessing the data. Then you need a storage 
object to store the conserved residues. You could use Bio::Seq::Meta derived 
objects to do that or store them as sequence feaures Bio::SeqFeature::Generic 
- or roll your own. The main question is that do you need to store 
residue-based information or a few large regions.

"then selectively replaces the CDs with IUPAC 'X'"

This could be implemented as a method that takes the alignment and the storage 
object(s) from your analysis and returns the new alignment. 
Bio::Align::Utilities could store that.

> I am converting this to be a Bioperl module to take advantage of
> AlignIO capabilities to read/write multiple alignment file types.

Good idea.

> There is a .pm package excise_cd.pm, which I have placed in Align
> (along with clustalw.pm etc). It is @ISA Bio::Root::Root. I have not

clustalw.pm is in Bio::AlignIO.  Only modules that are subclasses of 
Bio::AlignIO should go there.

> yet written an I file for it, but recognise the necessity of doing so
> for optimum compatability with Bioperl.

An I file is needed only if you expect that there will be several 
implementations of the interface.

> Only one method from excise_cd is used outside the module - excise(),
> which takes a SimpleAlign object made with AlignIO in the calling
> program and a hash function with options. The excise method extracts

For modularity, that hash storing all the options, need to turned into 
reusable objects.

> the sequence data from the SimpleAlign object, data mines the CD
> information and uses the options to guide the overwriting of residues
> with 'X'. excise() (will) then create an AlignIO output object of the
> requested format with the excised alignment. This is then returned to
> the caller, which can write out the excised alignment in the desired
> format.

> I think of this from an external perspective as a CD excising (Xing
> out) and data converting filter for alignment files.

>From your earlier description CD finding was the problem. 
Bio::SimpleAlign::slice do the slicing. On the other hand, from the 
description, I am not sure it is necessary to work with the alignment as a 
whole: It might be that it is best to treat each sequence separately. Of 
course, that depends on reliability of the alignment and what you have 
actually aligned!

> Is this a reasonable approach? 

> Would this be an appropriate module and 
> script for me to donate to Bioperl when properly done?

Yes, please.

 -Heikki

> Another question - I data mine from NCBI using only gi identifiers for
> the proteins. I have writen my own code to do this. Is there a Bioperl
> way to do get CD data for a protein and can this way allow me to
> obtain CD regions for PFAM or other identifiers as well?
>
> Thanks,
> Steve Lenk
> slenk at emich.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_ebi _ac _uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambridge, CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________