[Bioperl-l] whole genome annotation

Fri Jul 28 09:39:34 UTC 2006

Hello all,

I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go.  

Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is  simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies.

What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways.

I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. 

What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. 

If example code for what I'm trying to describe is included somewhere, great could someone point to where.

Thanks for your patience.
best regards,
Richard

Dr Richard Birnie
Scientific Officer
Section of Pathology and Tumour Biology
Welcome Brenner Building, LIMM
St James University Hospital
Beckett St, Leeds, LS9 7TF
Tel:0113 3438624
e-mail: r.birnie at leeds.ac.uk