[Bioperl-l] [Gmod-gbrowse] scores in Bio::DB::BigBed

Fri Jul 8 08:18:59 UTC 2011

Hi Timothy,

thanks a lot for sharing this great tool! It worked as you said:-)

Best,
Daniel

Am 06.07.2011 22:45, schrieb Timothy Parnell:
> Hi Daniel,
> 
> Since you have a need to collapse your data into useable genomic bins, I
> may have a tool that might help you. Have a look at this program
> http://code.google.com/p/biotoolbox/wiki/Pod_get_datasets
> (Disclosure: I am the author) This is normally used for data analysis, but
> you can also use it collapse data into single value bins.
> 
> You can collect scores from a BigBed file over genomic intervals and the
> scores will be combined in your favorite manner (mean, median, min, max,
> etc). For example, to take the median score value from all bed features in
> 500 bp windows across the genome, the command would look like this
> 
> get_datasets.pl --new --db chromosomes.gff3 --feature genome --win 500
> --method median --dataf my_data_file.bb --out output.txt
> 
> where chromsomes.gff3 is just a simple GFF3 file containing the
> chromosomes or contigs, and my_data_file.bb is your BigBed file. The other
> options simply tell the program to make a new genomic interval data file
> across the genome.
> 
> Once you have your data file, you can then convert it to a wig or bigWig
> file using data2wig.pl, found in the same biotoolbox collection.
> 
> Hope that helps you
> Tim
> 
> 
> On 7/6/11 1:54 AM, "Daniel Lang" <Daniel.Lang at biologie.uni-freiburg.de>
> wrote:
> 
>> Hi all,
>>
>> thanks a lot for your input on this!
>>
>> I want to explore the repeat structure of our model genome derived by
>> lastz self-alignments (using %id as score).
>> Since this is a HUGE file and I initially wanted to have the ability to
>> access the information for individual repeat regions also in gbrowse, I
>> wanted to use BigBed. Having the data in hand, it seems not to be such a
>> good idea anyway since the resulting repeat graph is much more complex
>> that I expected. So summarizing using the score and/or coverage will do
>> just fine;-)
>>
>> But as they are repeats they're overlapping. So if I see it correctly
>> BigWig/BedGraph aren't an option. Due to the size limitations, I have
>> not stored individual CIGAR strings that I could use to generate full-
>> blown SAM files. Or can I use BAM without sequence/qual data?
>>
>> Or is there an existing tool that would allow me to collapse overlapping
>> ranges with average scores for use in BigWig?
>>
>> Otherwise, I'll have to live with the coverage graphs for visualization
>> in gbrowse and use Bio::DB::BigBed::features to look at conservation
>> score at individual loci.
>>
>> Chris, the proposed BP page would be extremely helpful :-D
>>
>> Again, thanks a lot!
>>
>> Best,
>> Daniel
>>
>> Am 04.07.2011 18:10, schrieb Chris Fields:
>>> I generally follow these rules where I want a common set of possibly
>>> volatile features (e.g. specific transcriptome analysis) separate from
>>> my main 'stable' feature database (e.g. gene models):
>>>
>>> 1) BigBed - lightweight bundle of simple features where the ranges may
>>> overlap, but I'm not concerned about score.  I have found BED/BigBed
>>> scores of limited use in most cases to me unless I scale the data (since
>>> they must be 0-1000 integer values).  Document it very well if you do
>>> any scaling! YMMV
>>>
>>> 2) SAM/BAM - bundle of (possibly overlapping) features where summary
>>> stats are needed.  I've seen these used for BLAST/BLAT runs, etc.
>>>
>>> 3) BigWig - quantitative data of fixed or varying ranges covering
>>> entire genome, ranges can't overlap
>>>
>>> 4) BedGraph - quantitative sparse data, ranges can't overlap (these are
>>> converted over to BigWig for GBrowse, though)
>>>
>>> 5) Of course, one can also set up separate DB::SF::Store databases as
>>> well depending on your needs (I have used both the SQLite and MySQL
>>> adaptors for this).
>>>
>>> I think this is almost begging for a 'best practices' chart/table
>>> somewhere, maybe a GBrowse 'cookbook' of common data representation
>>> cases.
>>>
>>> chris
>>>
>>> On Jul 4, 2011, at 8:22 AM, Lincoln Stein wrote:
>>>
>>>> I had a look at the output of bigBedSummary, which is from Jim Kent's
>>>> source
>>>> tree (no Perl involved), and it appears that the statistics it
>>>> provides are
>>>> limited to coverage; so I don't think you can do anything with the
>>>> scores if
>>>> you're using BigBed indexing. Have a look at BedGraph=>BigWig and see
>>>> if it
>>>> meets your needs.
>>>>
>>>> Lincoln
>>>>
>>>> On Mon, Jul 4, 2011 at 9:04 AM, Lincoln Stein
>>>> <lincoln.stein at gmail.com>wrote:
>>>>
>>>>> Hi Dan,
>>>>>
>>>>> The documentation for BigBed is scanty; all I know about it is what is
>>>>> provided by the bigbed library is in Jim Kent's bigbed.h include
>>>>> file. I had
>>>>> thought that the scores in BED files would come through into the
>>>>> summary
>>>>> statistics like those in BigWig, but now I'm looking at the example
>>>>> data
>>>>> provided in Jim's source code, and see that the BigBed example source
>>>>> file
>>>>> has scores of "0".
>>>>>
>>>>> I'll investigate whether there is an issue in the Perl layer, but it
>>>>> could
>>>>> easily be a limitation in the library itself. Have you considered
>>>>> using a
>>>>> BedGraph file and indexing it with bedGraphToBigWig? I know that the
>>>>> Bio::DB::BigWig interface works perfectly to retrieve and summarize
>>>>> the
>>>>> scores.
>>>>>
>>>>> Lincoln
>>>>>
>>>>>
>>>>> On Sun, Jul 3, 2011 at 5:48 AM, Daniel Lang <
>>>>> Daniel.Lang at biologie.uni-freiburg.de> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> quick question about the BigBed adaptor: Is it correct that the bin
>>>>>> and
>>>>>> summary functions only return statistics about the number of
>>>>>> features in
>>>>>> the defined intervals?
>>>>>> I was expecting them to deliver statistics about the score if the
>>>>>> respective bb file has a defined score field.
>>>>>> If this is true, does this also mean that I cannot plot the
>>>>>> distribution
>>>>>> of scores in BigBed files in gbrowse?
>>>>>>
>>>>>> This is the first time I'm using BigBed, maybe I'm doing something
>>>>>> wrong...
>>>>>>
>>>>>> I had some trouble formatting the bed files correctly in order to see
>>>>>> the score in the features returned by the Bio::DB::BigBed::features()
>>>>>> routine. It seems the bigbed entries will only have a correctly
>>>>>> assigned
>>>>>> score field if you also provide a non-empty name field. Initially I
>>>>>> thought that the order of columns is irrelevant if you use an .as
>>>>>> file
>>>>>> in the bedToBigBed call, but that doesn't seem to be the case.
>>>>>>
>>>>>> Best,
>>>>>> Daniel
>>>>>> --
>>>>>>
>>>>>> Dr. Daniel Lang
>>>>>> University of Freiburg, Plant Biotechnology
>>>>>> Schaenzlestr. 1, D-79104 Freiburg
>>>>>> fax:        +49 761 203 6945
>>>>>> phone:      +49 761 203 6989
>>>>>> homepage:   http://www.plant-biotech.net/
>>>>>>           http://www.cosmoss.org/
>>>>>> e-mail <http://www.cosmoss.org/e-mail>:
>>>>>> daniel.lang at biologie.uni-freiburg.de
>>>>>>
>>>>>> #################################################
>>>>>> My software never has bugs.
>>>>>> It just develops random features.
>>>>>> #################################################
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --------
>>>>>> All of the data generated in your IT infrastructure is seriously
>>>>>> valuable.
>>>>>> Why? It contains a definitive record of application performance,
>>>>>> security
>>>>>> threats, fraudulent activity, and more. Splunk takes this data and
>>>>>> makes
>>>>>> sense of it. IT sense. And common sense.
>>>>>> http://p.sf.net/sfu/splunk-d2d-c2
>>>>>> _______________________________________________
>>>>>> Gmod-gbrowse mailing list
>>>>>> Gmod-gbrowse at lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lincoln D. Stein
>>>>> Director, Informatics and Biocomputing Platform
>>>>> Ontario Institute for Cancer Research
>>>>> 101 College St., Suite 800
>>>>> Toronto, ON, Canada M5G0A3
>>>>> 416 673-8514
>>>>> Assistant: Renata Musa <Renata.Musa at oicr.on.ca>
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Lincoln D. Stein
>>>> Director, Informatics and Biocomputing Platform
>>>> Ontario Institute for Cancer Research
>>>> 101 College St., Suite 800
>>>> Toronto, ON, Canada M5G0A3
>>>> 416 673-8514
>>>> Assistant: Renata Musa <Renata.Musa at oicr.on.ca>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> -- 
>>
>> Dr. Daniel Lang
>> University of Freiburg, Plant Biotechnology
>> Schaenzlestr. 1, D-79104 Freiburg
>> fax:        +49 761 203 6945
>> phone:      +49 761 203 6989
>> homepage:   http://www.plant-biotech.net/
>>            http://www.cosmoss.org/
>> e-mail:     daniel.lang at biologie.uni-freiburg.de
>>
>> #################################################
>> My software never has bugs.
>> It just develops random features.
>> #################################################
>>
>>
>>
>>
>> --------------------------------------------------------------------------
>> ----
>> All of the data generated in your IT infrastructure is seriously valuable.
>> Why? It contains a definitive record of application performance, security
>> threats, fraudulent activity, and more. Splunk takes this data and makes
>> sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-d2d-c2
>> _______________________________________________
>> Gmod-gbrowse mailing list
>> Gmod-gbrowse at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
> 

-- 

Dr. Daniel Lang
University of Freiburg, Plant Biotechnology
Schaenzlestr. 1, D-79104 Freiburg
fax:        +49 761 203 6945
phone:      +49 761 203 6989
homepage:   http://www.plant-biotech.net/
            http://www.cosmoss.org/
e-mail:     daniel.lang at biologie.uni-freiburg.de

#################################################
My software never has bugs.
It just develops random features.
#################################################