[Biopython-dev] [Bug 2643] Proposal: fastPhaseOutputIO for SeqIO

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Nov 6 12:14:04 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2643





------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2008-11-06 07:14 EST -------
Hi Marco,

This looks interesting :)

Could you attach the individual valid sample fastPHASE files as separate
attachments (so they can be integrated into the existing unit tests).  You seem
to have picked very small files in order to use them as doctests; a larger more
realistic example would be better for the unit tests (a few 5kb in size should
be OK - not too big).

Do you have URL for the file format documentation?  Are they always DNA for
example, or is RNA also possible?

If you want to include a fastPHASE parser in Bio.SeqIO it should ideally cope
with any valid fastPHASE output.  In the doctests you have an example:

... BEGIN GENOTYPES
... Ind1  # subpop. label: 6  (internally 1)
... T
... T C
... Ind2  # subpop. label: 6  (internally 1)
... C
... T
... END GENOTYPES

You're treating this as an error - "Two chromosomes with different length". 
Why isn't it parsed as four short sequences (of different lengths): "T", "TC",
"C", "T"?

Similarly, the final example:

... BEGIN GENOTYPES
... Ind1  # subpop. label: 6  (internally 1)
... T T T T T G A A A C C A A A G A C G C T G C G T C A G C C T G C A A T C T G
... Ind2  # subpop. label: 6  (internally 1)
... C T T T T G C C C T C A A A A G T G C T G T G C C A G T C T A C G G C C T G
... T T T T T G A A A C C A A A G A C G C T T C G T C A G T A T A C G A T C T A
... END GENOTYPES

Again, you raised an error - "Missing sequence in input file".  If this is a
valid file shouldn't it be parsed as three sequences?

On the other hand, are these hand edited files which deliberately break the
rules?  If fastPHASE files SHOULD always come in allele groups (of the same
length), then it would be better to integrate the parser into Bio.AlignIO
giving pairwise alignments (and you would be able to read it via Bio.SeqIO
automatically as well).

P.S. Your suggested format name "fastPhaseOutput" breaks the lower case rule. 
Would "fastphase" be OK, or is there more than one format?  e.g. an input
format which might be confused with this?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list