[Biopython-dev] [Biopython (old issues only) - Bug #2705] (Migrated) Nicer GC and AT content and skew functions

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Thu Jul 5 13:45:41 UTC 2018


Issue #2705 has been updated by Peter Cock.

Description updated
Status changed from New to Migrated

Migrated to GitHub issue https://github.com/biopython/biopython/issues/1719

----------------------------------------
Bug #2705: Nicer GC and AT content and skew functions
https://redmine.open-bio.org/issues/2705#change-15406

* Author: Peter Cock
* Status: Migrated
* Priority: Normal
* Assignee: Biopython Dev Mailing List
* Category: Main Distribution
* Target version: Not Applicable
* URL: 
----------------------------------------
This bug started out as a discussion on Bug 2671, based on some nucleotide scoring functions in GenomeDiagram which were used for plotting sequence properties along a sequence using a sliding window.  The basic underlying functions could make a nice addition under Bio.SeqUtils (rather than hiding them under Bio.Graphics.GenomeDiagram).

In particular, GenomeDiagram's Utilities.py included the following (non-windowed) nucleotide composition functions:

calc_gc_content - returns a float in the range 0 to 1.
calc_at_content - returns a float in the range 0 to 1.
calc_gc_skew - returns a float [*]
calc_at_skew - returns a float [*]

[*] As discussed on Bug 2671, these currently give zero if there is no AT content, which was a reasonable shortcut given these functions were originally used for plotting only.  They should instead raise an exception or return None or NaN instead.

Also, as implemented in GenomeDiagram, these functions do not cope with mixed case sequences (easily rectified).  Also, for GC and AT content these do not deal with ambiguous nucleotides (where we could follow the existing Bio.SeqUtils convention).

Bio.SeqUtils already has several related functions including:

GC - returns a float (a percentage in the range 0 to 100)
GC123 - returns a tuple of four floats (percentages between 0 and 100)

GC_skew - returns a list of floats using a default window size of 100bp.  Gives
a floating point exception if there is no GC content in any window.

Personally I don't like the fact that the existing GC function returns a number
between 0 and 100 (rather than 0 and 1).  Leighton agreed.

I don't think the current GC_skew function is intuitive and doesn't cover the
non-windowed use-case where you want the GC_skew of the whole sequence passed
in.  This is important if you want to do your own windowing (e.g. comparing GC
skew of individual genes to the whole genome).

Because they differ from the existing Bio.SeqUtils code, I think there is a
case for adding the four non-windowed functions from GenomeDiagram's
Utilities.py under Bio.SeqUtils.  Each would take a single argument, a sequence (coping with a string, Seq object or MutableSeq object).  I have no particularly strong views on the naming of these functions.  Perhaps they could be located under a sub module like Bio.SeqUtils.Nucleotides or Bio.SeqUtils.NucUtils?  The existing GC functions in Bio.SeqUtils could be deprecated or at least declared obsolete.

This would also be a good opportunity to explicitly specify what we expect to get back for the GC content when there are ambiguous nucleotides.

e.g. Following Bio.SeqUtils.GC, only count C, G and S (which means C or G) (in either case) and divide by the length giving a lower bound.  Here GC("ACGTN") is 40%.  An alternative approach might be to treat an N as 50% GC, and H (which is A, C or T) as 66.6% GC etc, meaning GC("ACGTN") gives 50%.

The same approach should be used for the AT percentage, for example the current lower bound approach would count only A, T and W characters (in either case).



-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20180705/724c260f/attachment.html>


More information about the Biopython-dev mailing list