[Biopython-dev] Fwd: contributing comparative genomics tools

Fri Aug 4 10:38:52 UTC 2006

Begin forwarded message:

> From: Christopher Lee <leec at chem.ucla.edu>
> Date: August 3, 2006 9:11:42 PM EDT
> To: biopython-dev-owner at lists.open-bio.org
> Subject: Fwd: contributing comparative genomics tools
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
> there appears to be an error in your code submission instructions  
> on the biopython.org/wiki, or in the configuration of the biopython- 
> dev list server.  The code submission instructions tell me to  
> submit my proposal by email to biopython-dev at biopython.org, but the  
> list server responds by saying that all mail will automatically be  
> rejected!  Please forward this proposal to the appropriate people  
> (presumably biopython-dev?), and let me know that you have done  
> so.  Otherwise I won't have any way of knowing whether anyone even  
> reads this email address...
>
> Yours with thanks,
>
> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>
> Begin forwarded message:
>
>> You are not allowed to post to this mailing list, and your message  
>> has
>> been automatically rejected.  If you think that your messages are
>> being rejected in error, contact the mailing list owner at
>> biopython-dev-owner at lists.open-bio.org.
>>
>>
>> From: Christopher Lee <leec at chem.ucla.edu>
>> Date: August 3, 2006 3:55:52 PM PDT
>> To: biopython-dev at biopython.org
>> Cc: Namshin Kim <deepreds at gmail.com>
>> Subject: contributing comparative genomics tools
>>
>>
>> Hi Biopython developers,
>> I'd like to contribute some Python tools that my lab has been  
>> developing for large-scale comparative genomics database query.   
>> These tools make it easy to work with huge multigenome alignment  
>> databases (e.g. the UCSC Genome Browser multigenome alignments)  
>> using a new disk-based interval indexing algorithm that gives very  
>> high performance with minimal memory usage.  e.g. whereas queries  
>> of the UCSC 17genome alignment typically take about 30 sec. per  
>> query using MySQL, the same query takes about 200 microsec. per  
>> query, making it possible to run huge numbers of queries for  
>> genome-wide studies.
>>
>> Here's an example usage (click the URL or just look at the code  
>> below)
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq- 
>> align.html#SECTION000125000000000000000
>>
>> We've tested this code very extensively in our own research, and  
>> it has had four open source releases so far.  At this point the  
>> code is in production use.  All the code is compatible back to  
>> Python version 2.2, but not 2.1 or before (we use generators).   
>> There is C code (accessed as Python classes) for the high- 
>> performance interval database index.  For details of history see  
>> the website
>> http://www.bioinformatics.ucla.edu/pygr
>>
>> There is also extensive tutorial and reference documentation:
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/
>>
>> Let me know what questions you have, and what process we would  
>> need to follow to contribute this code.
>>
>> Yours with best wishes,
>>
>> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>>
>>
>> ####### EXAMPLE USAGE
>> from pygr import cnestedlist
>> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE  
>> ALIGNMENT DB
>>
>> def printResults 
>> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None):
>>     'get alignment of each genome to site, print %identity and % 
>> aligned'
>>     for src,dest,edge in msa[site].edges(mergeMost=True): #  
>> ALIGNMENT QUERY!
>>         print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \
>>               %(altID,cluster_id,prefix,seqNames[dest],
>>                 100.*edge.pIdentity(),100.*edge.pAligned(),src[: 
>> 2],dest[:2])
>>
>> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs):
>>     'gene must be a slice of a sequence in our genome alignment msa'
>>     ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES
>>     ss2=gene[start2-2:start2]
>>     ss3=gene[stop:stop+2]
>>     e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES
>>     e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND  
>> stop
>>     zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd  
>> REQUESTS
>>     cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS  
>> TO MINIMIZE TRAFFIC
>>     for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3), 
>> ('e1',e1),('e2',e2)]:
>>         printResults(prefix,msa,site,seqNames=~ 
>> (msa.seqDict),**kwargs)
>>
>> # RUN A QUERY LIKE THIS...
>> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop)
>>
>> ############ EXPLANATION & NOTES
>> David Haussler's group has constructed alignments of multiple  
>> genomes. These alignments are extremely useful and interesting,  
>> but so large that it is cumbersome to work with the dataset using  
>> conventional methods. For example, for the 8-genome alignment you  
>> have to work simultaneously with the individual genome datasets  
>> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as  
>> well as the huge alignment itself. Pygr makes this quite easy.  
>> Here we illustrate an example of mapping an alternative 3' exon,  
>> which has two alternative splice sites (start1 and start2) and a  
>> single terminal splice site (stop). We use the alignment database  
>> to map each of these splice sites onto all the aligned genomes,  
>> and to print the percent-identity and percent-aligned for each  
>> genome, as well as the two nucleotides consituting the splice site  
>> itself. To examine the conservation of the two exonic regions  
>> (between start1 and start2, and the adjacent region terminated by  
>> stop, we print the same information for each genome's alignment to  
>> these two regions as well. The code first opens the alignment  
>> database. The function (getAlt3Conservation) obtains sequence  
>> slice objects representing the various ``sites'' to be queried.  
>> The actual alignment database query is performed in printResults:
>>
>>     * The alignment database query is in the first line of  
>> printResults(). msa is the database; site is the interval query;  
>> and the edges methods iterates over the results, returning a tuple  
>> for each, consisting of a source sequence interval (i.e. an  
>> interval of site), a destination sequence interval (i.e. an  
>> interval in an aligned genome), and an edge object describing that  
>> alignment. We are taking advantage of Pygr's group-by operator  
>> mergeMost, which will cause multiple intervals in a given sequence  
>> to be merged into a single interval that constitutes their  
>> ``union''. Thus, for each aligned genome, the edges iterator will  
>> return a single aligned interval. The alignment edge object  
>> provides some useful conveniences, such as calculating the percent- 
>> identity between src and dest automatically for you. pIdentity()  
>> computes the fraction of identical residues; pAligned computes the  
>> fraction of aligned residues (allowing you to see if there are big  
>> gaps or insertions in the alignment of this interval). If we had  
>> wanted to inspect the detailed alignment letter by letter, we  
>> would just iterate over the letters attribute instead of the edges  
>> method. (See the NLMSASlice documentation for further information).
>>
>>     * src[:2] and dest[:2] print the first two nucleotides of the  
>> site in gene and in the aligned genome.
>>
>>     * it's worth noting that the actual sequence string  
>> comparisons are being done using a completely different database  
>> mechanism (formerly NCBI's fastacmd, now our own (much faster)  
>> pureseq text format), not the cnestedlist database. Basically,  
>> each genome is being queried as a separate BLAST formatted  
>> database, represented in Pygr by the BlastDB class. Pygr makes  
>> this complex set of multi-database operations more or less  
>> transparent to the user. For further information, see the BlastDB  
>> documentation.
>>
>>     * The other operations here are entirely vanilla: mainly  
>> slicing a gene sequence to obtain the specific sites that we want  
>> to query. Note: gene must itself be a slice of a sequence in our  
>> alignment, or the alignment query msa[site] will raise an  
>> IndexError informing the user that the sequence site is not in the  
>> alignment.
>>
>>     * The only slightly interesting operation here is the use of  
>> interval addition to obtain the ``union'' of two intervals, e.g.  
>> e1=ss1+ss2. This obtains a single interval that contains both of  
>> the input intervals.
>>
>>     * When the print statement requests str() representations of  
>> these sequence objects, Pygr uses fastacmd -L to extract just the  
>> right piece of the corresponding chromosomes from the eight BLAST  
>> databases.
>>
>> (Actually, because of Pygr's caching / optimizations, considerably  
>> more is going on than indicated in this simplified sketch. But you  
>> get the idea: Pygr makes it relatively effortless to work with a  
>> variety of disparate (and large) resources in an integrated way.)
>>
>> Here is some example output:
>>
>> 1       Mm.99996        ss1     hg17    50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     canFam1 50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     panTro1 50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     rn3     100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     hg17    100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     canFam1 100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     panTro1 100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     rn3     100.0   100.0   AG      AG
>> 1       Mm.99996        ss3     hg17    100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     canFam1 100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     panTro1 100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     rn3     100.0   100.0   GT      GT
>> 1       Mm.99996        e1      hg17    78.9    100.0   AG      GG
>> 1       Mm.99996        e1      canFam1 84.2    100.0   AG      GG
>> 1       Mm.99996        e1      panTro1 77.6    100.0   AG      GG
>> 1       Mm.99996        e1      rn3     97.4    98.7    AG      AG
>> 1       Mm.99996        e2      hg17    91.6    99.1    CC      CC
>> 1       Mm.99996        e2      canFam1 88.8    99.1    CC      CC
>> 1       Mm.99996        e2      panTro1 91.6    99.1    CC      CC
>> 1       Mm.99996        e2      rn3     97.2    100.0   CC      CC
>>
>>
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (Darwin)
>>
>> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i
>> xhxE2zrffLaiCjSiEp4Eo6k=
>> =BeOe
>> -----END PGP SIGNATURE-----
>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (Darwin)
>
> iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm
> Wx/4AIstFjb+rbqY2QBppLg=
> =fghY
> -----END PGP SIGNATURE-----