[Biopython-dev] Sequences and simple plots

Peter biopython at maubp.freeserve.co.uk
Fri Sep 26 16:11:50 UTC 2008


On Fri, Sep 26, 2008 at 2:40 PM, Jared Flatow <jflatow at northwestern.edu> wrote:
>
> On Sep 26, 2008, at 5:15 AM, Peter wrote:
>
>> Cut and paste for people to comment on directly,
>
> Ok, cool.
>
>> The first shows a histogram of sequence lengths in a FASTA file (based
>> having recently done this for some real assembly data).  Sample output:
>> http://biopython.org/DIST/docs/tutorial/images/hist_plot.png
>>
>> ...
>
> Its a perfectly fine example, my only comment would be to do something like
> this:
>
> seqs = list(SeqIO.parse(handle, 'fasta'))
> hist([len(seq) for seq in seqs], bins=20)
>
> I like to keep the whole sequences in memory, especially if I am just
> digging around the data.

I see what you mean - and maybe that is more realistic.

One change I'd make is avoiding using seq or seqs are variable names
for SeqRecord objects.  I've generally tried to use record and records in
the documentation.

i.e. maybe like this:

import pylab
from Bio import SeqIO
records = list(SeqIO.parse(open("ls_orchid.fasta"), "fasta")

#Histogram of lengths
pylab.hist([len(record) for records in records], bins=20)
pylab.title("%i orchid sequences\nLengths %i to %i" \
      % (len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Count")
pylab.show()

> Also I use the alpha parameter a lot for histograms, especially when
> doing overlapping ones. So then you can also do something like this:
>
> hist([len(seq) for seq in seqs if GC(seq.seq) < .5], bins=20, alpha=.5,
> fc='r')
> hist([len(seq) for seq in seqs if GC(seq.seq) >= .5], bins=20, alpha=.5,
> fc='b')
>

Fun.  I didn't want to get into anything too advanced on the pylab side,
rather I wanted to focus on the bioinformatics.  Does anyone else think
more advanced graphical demonstrations would be worthwhile?

>> The second is based on the GC% example we used for the BOSC 2008
>> presentation: http://biopython.org/DIST/docs/tutorial/images/gc_plot.png
>>
>> ...
>
> Again, if you had all the sequences in a list:
>
> plot(sorted(GC(seq.seq) for seq in seqs))

I like the use of sorted here, rather than the two step make a list
then sort it.

Peter



More information about the Biopython-dev mailing list