[Biopython-dev] [Bug 2643] Proposal: fastPhaseOutputIO for SeqIO

Mon Nov 10 16:34:34 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2643

------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk  2008-11-10 11:34 EST -------
Hi Marco,

Looking at your example, the important part of the file is this bit:

...
BEGIN GENOTYPES
Ind1  # subpop. label: 6  (internally 1)
T T T T T G A A A C C A A A G A C G C T G C G T C A G C C T G C A A T C T G
T T T T T G C C C C C A A A A G C G C G T C G T C A G T C T A A G A C C T A
Ind2  # subpop. label: 6  (internally 1)
C T T T T G C C C T C A A A A G T G C T G T G C C A G T C T A C G G C C T G
T T T T T G A A A C C A A A G A C G C T T C G T C A G T A T A C G A T C T A
END GENOTYPES

Quoting the manual again, "Output ï¬les for inferred haplotypes or imputed
genotypes contain two lines per given diploid individual, with the order of
individuals corresponding to that supplied in the input ï¬le."

In this example we have two individuals, Ind1 and Ind2 (presumably with
automatically assigned names).  In a real world example, how many individuals
would you expect to use?  Does it make more sense to return a pairwise
alignment for each individual, rather than one large combined alignment?  One
of the main points for using iterators/generators is they allow us to deal with
very large files by not having to keep everything in memory.  Now I don't have
a feel for what sized files fastPhase could output - maybe a single large
alignment is fine.

i.e. One combined alignment:

IUPACUnambiguousDNA() alignment with 4 rows and 38 columns
TTTTTGAAACCAAAGACGCTGCGTCAGCCTGCAATCTG Ind1_all1
TTTTTGCCCCCAAAAGCGCGTCGTCAGTCTAAGACCTA Ind1_all2
CTTTTGCCCTCAAAAGTGCTGTGCCAGTCTACGGCCTG Ind2_all1
TTTTTGAAACCAAAGACGCTTCGTCAGTATACGATCTA Ind2_all2

versus one pairwise alignment per individual:

IUPACUnambiguousDNA() alignment with 2 rows and 38 columns
TTTTTGAAACCAAAGACGCTGCGTCAGCCTGCAATCTG Ind1_all1
TTTTTGCCCCCAAAAGCGCGTCGTCAGTCTAAGACCTA Ind1_all2

IUPACUnambiguousDNA() alignment with 2 rows and 38 columns
CTTTTGCCCTCAAAAGTGCTGTGCCAGTCTACGGCCTG Ind2_all1
TTTTTGAAACCAAAGACGCTTCGTCAGTATACGATCTA Ind2_all2

I think you'll have to decide this (unless anyone else following this has a
view - Tiago maybe?)

P.S. Have you tried with and without the -n option to automatically name the
individuals?  What happens if the name includes a hash character (#)?  I would
hope fastPhase would treat this as an error, but it could end up in the output
file and confuse the parser.

P.P.S. Based on the examples in the manual, typical output might use lower case
nucleotides (a, t, c, g) or numbers (0, 1).  I presume upper case nucleotides
are also fine, but defaulting to this is a bad idea.  Please default to
Bio.Alphabet.single_letter_alphabet which seems to be the the safest choice (we
shouldn't guess).

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.