[Bioperl-l] whole genome annotation

Fri Jul 28 11:59:17 UTC 2006

Richard Birnie wrote:
> Hello all,
> 
> I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go.  
> 
> Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is  simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies.
> 
> What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways.
> 
> I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. 
> 
> What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. 
> 
> If example code for what I'm trying to describe is included somewhere, great could someone point to where.

Hi, Richard.

Bioperl is good for many things, but for simply grabbing all the 
locations of human genes in the genome and chromosome band locations, I 
wouldn't use bioperl.  It sounds to me like you are interested in 
getting the genes associated with each chromosomal band?  If so, just 
download the cytoband.txt and refFlat.txt files from the UCSC genome 
browser site.  cytoband.txt contains the base pair locations for each of 
the cytobands.  refFlat.txt contains the base pair locations of "refseq" 
genes.  It is then simply a matter of finding overlapping regions (genes 
with cytobands) to determine which genes are in which cytobands.  Since 
the files are tab-delimited text, they are very easy to work with (in 
perl, excel, python, ...).  Don't get me wrong--I really appreciate the 
power of bioperl, but in this case, your task lends itself to a simpler 
(and MUCH) faster approach.

Sean