[Biopython] arrays in Biopython

Mon Jan 27 13:43:56 UTC 2020

Hi Michiel,

I only just saw your email - Google has mistagged it as spam.

Multiple sequence alignments are also often viewed as arrays of characters,
but critically there the row and column indices would typically be integers
(or sequence names), never letters (amino acid or nucleotide).

I agree with you that subclassing NumPy arrays is tricky (certainly my
impression from following the numpy mailing list), to the point that I
wonder if we are better off avoiding it? You certainly make a good point
that we'd be better off doing that once, rather than N times for these
similar use cases.  I take it there is no standard NumPy array subclass
with the member-element-based indexing property you want to use?

Peter

On Sun, Jan 12, 2020 at 1:00 PM Michiel de Hoon <mjldehoon at yahoo.com> wrote:

> Dear all,
>
> Currently there are four classes in Biopython that model an array where
> the letters can be used as indices:
>
> Bio.Align.substitution_matrices: Array class
> Bio.Align.AlignInfo: PSSM class
> Bio.Phylo.TreeConstruction: _Matrix class
> Bio.motifs.matrix: GenericPositionMatrix
>
> (and the FreqTable class in Bio.SubsMat.FreqTable is similar).
>
> For example, the Array class in Bio.Align.substitution_matrices allows
> you to do things like
> >>> from Bio.Align.substitution_matrices import Array
> >>> a = Array("ACGT", dims=2)
> >>> a
> Array([[0., 0., 0., 0.],
>        [0., 0., 0., 0.],
>        [0., 0., 0., 0.],
>        [0., 0., 0., 0.]],
>          alphabet='ACGT')
>
> >>> a['C','A'] = 6
> >>> a
> Array([[0., 0., 0., 0.],
>        [6., 0., 0., 0.],
>        [0., 0., 0., 0.],
>        [0., 0., 0., 0.]],
>          alphabet='ACGT')
> >>> sum(a['C'])
> 6.0
> >>> a[3,'G'] = 1
> >>> a['A',:] = 4
> >>> a
> Array([[4., 4., 4., 4.],
>        [6., 0., 0., 0.],
>        [0., 0., 0., 0.],
>        [0., 0., 1., 0.]],
>          alphabet='ACGT')
> >>> sum(a[:, 'A'])
> 10.0
> >>>
> >>> from numpy import sin
> >>> sin(a)
> Array([[-0.7568025 , -0.7568025 , -0.7568025 , -0.7568025 ],
>        [-0.2794155 ,  0.        ,  0.        ,  0.        ],
>        [ 0.        ,  0.        ,  0.        ,  0.        ],
>        [ 0.        ,  0.        ,  0.84147098,  0.        ]],
>          alphabet='ACGT')
>
>
>
> This class was implemented as a subclass of a numpy array. This has the
> big advantage that the array acts as a numpy array (e.g. you can apply
> numpy functions to it and get back an array of the same class, as in the
> example above), but unfortunately subclassing numpy arrays is not easy (see
> the code in Bio.Align.substitution_matrices).
>
> Would it then make sense to make this class available as a general-purpose
> array class where strings can be used as indices?
> For example, inside a new Bio.math module.
> Other modules in Biopython could then either make use of this class
> directly, or subclass it if needed.
>
> Thanks,
> -Michiel
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> https://mailman.open-bio.org/mailman/listinfo/biopython
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20200127/6dd5ec8a/attachment.htm>