[Bioperl-l] Genbank parsing using Bioperl

Fri Apr 21 15:26:53 UTC 2006

I'm adding my 2c since I've got a bit of time on my hands.  I'll add that I
found most of these answers by looking through the mail list archives (now
searchable through Gmane) and the BioPerl wiki.

I believe Sean pointed out the HOWTO on the BioPerl wiki: 

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Getting_Sequences

In theory, you should be able to retrieve from the CDS feature which gene
feature or transcript each coding feature belongs to, and normally vice
versa.  I may be wrong (I work with bacterial genome sequences mainly), but
I believe this is completely dependent on how well the features are
annotated (which can vary greatly between different sequencing centers) so
can be a bit tricky depending on the source of the GenBank file.  I would,
instead, try a database that's well-curated and has a consistent interface
across different genome projects.  In other words, something like what Sean
suggested, like Ensembl:  

http://www.ensembl.org/index.html

Use can use the Ensembl Perl API to retrieve data from Ensembl databases:

http://www.ensembl.org/info/software/core/core_tutorial.html

You could also have a look at Entrez Gene; Brian's working on modules (in
CVS) for retrieving and parsing Entrez Gene's output:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene

You'll need the Bio::ASN1 parser for Brian's modules:

http://sourceforge.net/projects/egparser

Both Ensembl and Entrez Gene are constantly updated for transcript/protein
information and are likely what you are looking for.

Chris

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Prabu R
> Sent: Friday, April 21, 2006 8:25 AM
> To: bioperl-l at lists.open-bio.org; Sean Davis
> Subject: Re: [Bioperl-l] Genbank parsing using Bioperl
> 
> Dear All,
> 
> I feel sorry for making a small mistake in my earlier mail
> 
> I am not actually using Genbank releases, But Refseq Genome build gbk
> files
> of NCBI (ftp.ncbi.nih.gov/genomes/)
> 
> Those files are genbank formatted and contains Refseq IDs.
> 
> Kindly help.
> 
> R. Prabu
> 
> ----------------------------
> Dear all!
> 
> I am a novice bioperl user, trying to parse Genbank files with Bioperl
> modules to get some specific features and details.
> 
> Anyone please tell me, whether we can retrive a Gene, its Transcript ID
> and
> its Protein ID from the Genbank file.
> 
> I mainly need to extract with one to one relationship between TranscriptID
> and Protein ID.
> 
> I was trying this. I was able to take these details if the gene is not
> alternatively spliced.
> 
> If a gene contains multiple mRNA/CDS feature, I am not able to build the
> relationship between Transcript and its Protein.
> 
> Kindly help me to find out whether this is possible in Bioperl.
> 
> Thanks in advance,
> 
> R. Prabu
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l