[Bioperl-l] Concatenating Bacterial Genome Sequence

Lin, Xiaoying Xiaoying.Lin at celera.com
Wed Nov 5 09:47:57 EST 2003


I believe NCBI's 'ref' genomes are presented as 1 seq/chromosome. If the
genome you are interested in is published, it should be in that
collection.
They are in the GenBank format, you can convert with SeqIO into EMBL if
that is the only format your script take.

If the genome is not in NCBI genomes yet, I found it is easier to load
things into the database first, and do the merging afterwards, sequence
first, and feature coordinates. If there are overlaps between pieces,
you have to watch out for features that fall into the overlap, or
spanning the boundary of a overlap, and which sequence to use if the
overlapping sequences differ.  This is often an issue when dealing with
BACs,  but those pieces in your case should have been artificially cut
from the chromosome, and the overlapping regions should be identical, so
it may not be a problem at all.   Depends how your database is set up,
this script should be 1 day worth of work. I did not use bioperl (at
version 0.7) then.

Regards,

Xiaoying 
-----------
Xiaoying Lin, PhD
Senior Manager 
Celera Genomics 
45 West Gude Drive, Rockville, MD 20850 
240-453-3695, 240-453-3768 (FAX), Xiaoying.Lin at celera.com 


> -----Original Message-----
> From: michael watson (IAH-C) [mailto:michael.watson at bbsrc.ac.uk] 
> Sent: Wednesday, November 05, 2003 6:51 AM
> To: 'ensembl-dev at ebi.ac.uk'
> Cc: Bioperl
> Subject: [Bioperl-l] Concatenating Bacterial Genome Sequence
> 
> 
> Hi
> 
> First of all, apologies to posting to both lists at once, I 
> realise a lot of people will get this e-mail twice, but I 
> believe this question is of relevance to both lists.
> 
> Those of you on the ensembl list will be familiar with my 
> (successful!) attempts to put the Salmonella genome into an 
> ensembl (well, actually, an otter) database - the 
> parse_pathogen script, by and large, worked very well and I 
> have a (mostly) functional website.
> 
> The problem comes from the fact that the EMBL entries for the 
> bacterial genomes I am interested in consist of many 
> different sequences which represent segments of the genome.  
> So parse_pathogen handles this by creating a new ensembl 
> "chromosome" for each segment.  Of course these bacterial 
> genomes are circular and constant, so splitting them up into 
> chromosomes doesn't make too much sense, but I can get away 
> with it most of the time with typhi CT18, which is in 20 
> pieces, and typhi Ty2, which is in 16 pieces, but when I come 
> to typhimurium LT2, this is in 220 pieces;  If I want to pose 
> the question "Are these two gene's adjacent on the genome?", 
> normally a very simple task using ensembl, I will have to do 
> some jumping through hoops figuring out if the genes are at 
> the end of segments, and if so, what are the adjacent 
> segments and are the gene's adjacent on the genome but on two 
> different segments... 
> 
> So what would be realy great, and this is where bioperl 
> (maybe) comes in, is something that takes the EMBL entry for 
> the S.typhimurium genome, which is actually 220 EMBL 
> sequences, and creates a single EMBL sequence entry for the 
> whole genome, with all the feature's updated so that their 
> location is relative to the start of the whole genome, and 
> not just of the segment they are on.   Has anyone done this 
> and care to share?  If not, any comments on how 
> difficult/easy this might be using Bioperl would be welcome.
> 
> Regards
> 
> Mick
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org 
> http://portal.open-> bio.org/mailman/listinfo/bioperl-l
> 



More information about the Bioperl-l mailing list