[Biopython-dev] [Biopython - Feature #3258] (New) phastCons score parser

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Sat Jun 25 00:32:31 UTC 2011


Issue #3258 has been reported by Beisi Xu.

----------------------------------------
Feature #3258: phastCons score parser
https://redmine.open-bio.org/issues/3258

Author: Beisi Xu
Status: New
Priority: Normal
Assignee: Beisi Xu
Category: Main Distribution
Target version: 1.57
URL: 


usage:

chr*.phastCons46way.placental.wigFix.gz should be downloaded:

mkdir -p /home/user/data/hg19/phastcons/
cd /home/user/data/hg19/phastcons/
for i in `seq 1 21` X Y
do
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/chr${i}.phastCons46way.placental.wigFix.gz
done  

you can download http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/chr21.phastCons46way.placental.wigFix.gz only for test

test_phast.py can be found in the source code file

#########

it takes 100 seconds to read a 30M(80k lines) gziped phastCons compressed file. And stored the offset of each record that allow quick search.
so it takes very little memory for compressed phastCons file.

uncompress file format take more memories, optimized that only loading one chrom one time, so it will be effient and lower memory if you are scoring chrom one after one, but it will take more time if you scoring like: chr1 1, chr2 1, chr1 2, chr2 3

result for hg19 46ways :

$time python test_phast.py

    chr21 9411193 0
    chr21 9411194 0.053
    chr21 9411195 0.044
    chr21 9448727 0.009
    chr21 9448728 0
    chr21 9448729 0
    chr21 9448878 0
    chr21 9448879 0.002
    chr21 9448880 0.004

    real    1m47.140s
    user    1m46.996s
    sys     0m0.079s

#########

http://genome.ucsc.edu/goldenPath/help/phastCons.html

phastCons File Format                                                                                                                                                

phastCons data files contain the compressed conservation scores that underlie the Conservation annotation track and the phastCons table. For a detailed description of the algorithm used to produce the scores, see the Genome Browser description page associated with the Conservation track.

File Format (assemblies released Nov. 2004 and later)

When uncompressed, the file contains a declaration line and one column of data in wiggle table fixed-step format:

  fixedStep chrom=scaffold_1 start=3462 step=1
  0.0978.
  0.1588
  0.1919
  0.1948.
  0.1684.

1. Declaration line: The declaration line specifies the starting point of the data in the assembly. It consists of the following fields:

    * fixedStep -- keyword indicating the wiggle track format used to write the data. In fixed step format, the data is single-column with a fixed interval between values.
    * chrom -- chromosome or scaffold on which first value is located.
    * start -- position of first value on chromosome or scaffold specified by chrom. NOTE: Unlike most Genome Browser coordinates, these are one-based.
    * step -- size of the interval (in bases) between values..

A new declaration line is inserted in the file when the chrom value changes, when a gap is encountered (requiring a new start value), or when the step interval changes.

2. Data lines: The first data value below the header shows the score corresponding to the position specified in the header. Subsequent score values step along the assembly in one-base intervals. The score shows the posterior probability that phastCons's phylogenetic hidden Markov model (HMM) is in its most-conserved state at that base position.

File Format (assemblies prior to Nov. 2004)

When uncompressed, the data file contains two columns:

  294   0.0953
  295   0.0948
  296   0.0943
  297   0.0936
  298   0.0929
  299   0.0921

Column #1 contains a one-based position coordinate. Column #2 contains a score showing the posterior probability that phastCons's phylogenetic hidden Markov model (HMM) is in its most conserved state at that base position.
                                                                                                                                                                     
References for phastCons

Siepel A and Haussler D (2005). Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, pp. 325-351, Springer, New York.

Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005).

For a discussion of the methods used to calculate the phastCons scores, see the description page for the hg17 Conservation track in the Genome Browser



----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org




More information about the Biopython-dev mailing list