[Biopython-dev] [Bug 2643] Proposal: fastPhaseOutputIO for SeqIO

Thu Nov 6 16:06:46 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2643

------- Comment #8 from dalloliogm at gmail.com  2008-11-06 11:06 EST -------
(In reply to comment #4)
> Hi Marco,

Hi!! :)

> This looks interesting :)
> 
> Could you attach the individual valid sample fastPHASE files as separate
> attachments (so they can be integrated into the existing unit tests).  You seem
> to have picked very small files in order to use them as doctests; a larger more
> realistic example would be better for the unit tests (a few 5kb in size should
> be OK - not too big).

ok
Actually I have been using files which come from our laboratory analysis, and I
would like to ask if I include them here and how first.

> Do you have URL for the file format documentation?  

The fastphase format seems to be described only in fastphase's manual, which is
only accessible after accepting a license agreement.
I could contact the authors of the program to ask them to publish the format
specifications publicly. It would be in their interest, as otherwise the format
could be considered as a not standard.
I'll let you know..

> Are they always DNA for example, or is RNA also possible?

They should be DNA, In principle they could be also genes, or other kind of
characters, but this software is designed for the purpose of reconstructing
haplotypes from SNPs/microsatellites.
Maybe Tiago has some more experience in this..

> If you want to include a fastPHASE parser in Bio.SeqIO it should ideally cope
> with any valid fastPHASE output.  In the doctests you have an example:
> 
> ... BEGIN GENOTYPES
> ... Ind1  # subpop. label: 6  (internally 1)
> ... T
> ... T C
> ... Ind2  # subpop. label: 6  (internally 1)
> ... C
> ... T
> ... END GENOTYPES
> You're treating this as an error - "Two chromosomes with different length". 
> Why isn't it parsed as four short sequences (of different lengths): "T", "TC",
> "C", "T"?

You should not have a file in which a chromosome is longer than the other
one... instead, you should have a '?' indicating data that the program could
not infer.

> Similarly, the final example:
> 
> ... BEGIN GENOTYPES
> ... Ind1  # subpop. label: 6  (internally 1)
> ... T T T T T G A A A C C A A A G A C G C T G C G T C A G C C T G C A A T C T G
> ... Ind2  # subpop. label: 6  (internally 1)
> ... C T T T T G C C C T C A A A A G T G C T G T G C C A G T C T A C G G C C T G
> ... T T T T T G A A A C C A A A G A C G C T T C G T C A G T A T A C G A T C T A
> ... END GENOTYPES
> 
> Again, you raised an error - "Missing sequence in input file".  If this is a
> valid file shouldn't it be parsed as three sequences?

Because that would mean that one individual has only a chromosome.
It doesn't make sense to run fastPhase on an haploid individual.

> On the other hand, are these hand edited files which deliberately break the
> rules?  

Yes. Usually you shouldn't have neither of the two cases. But I find it useful
when a script tells me if there are weird things in my files (I could have
modified them accidentally).
This could be refactored in a check_fileformat function.

> If fastPHASE files SHOULD always come in allele groups (of the same
> length), then it would be better to integrate the parser into Bio.AlignIO
> giving pairwise alignments (and you would be able to read it via Bio.SeqIO
> automatically as well).

This is good idea, I didn't think of it.
But how should I modify the module to produce AlignIO objects?

> P.S. Your suggested format name "fastPhaseOutput" breaks the lower case rule. 
> Would "fastphase" be OK, or is there more than one format?  e.g. an input
> format which might be confused with this?

I agree.. I wasn't sure of biopython's naming conventions.

> 
> Peter
> 
Scheet and Stephens (2006)

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.