[Bioperl-l] Genbank parsers

Heikki Lehvaslaiho heikki@ebi.ac.uk
Wed, 27 Mar 2002 08:31:12 +0000


Hong Qin wrote:
> 
> Hi all,
> 
> You can tell my laziness from this question.  Could someone suggest a good
> parser to take CDS sequences from genbank formatted files. (The *.gbff file
> from NCBI FTP site).  If the output file is FASTA, it would be great.


I think this should be in FAQ. Elia's answer gives the right pointers how to
do it.

However, the problem is a bit more complex than that. Quite often
the CDS feature contains a join statement:

FT   CDS   join(U21925.1:818..987,U21926.1:258..420,
FT         U21927.1:428..520,U21928.1:196..336,U21929.1:279..415,
FT         U21930.1:895..1014,516..708)

and unless you are able to go and fetch the needed entry from a ramdom
access data store, you can not do it. 

This would be nice task for someone wanting to start programming in
bioperl... Bio::Tools::CDSExtractor which would use Bio::SeqIO and
Bio::DB::BioFetch (Using the Registry would be even better). A more generic
module would be Bio::Tools::SeqFeatureExtractor.

The EMBOSS program which is able to do this (given the sequence database is
populated) is coderet
http://www.uk.embnet.org/Software/EMBOSS/Apps/coderet.html


	-heikki


> Thanks a lot,
> 
> Hong
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________