[Biopython-dev] [Bug 2643] Proposal: fastPhaseOutputIO for SeqIO

Thu Nov 6 18:09:55 UTC 2008

On Thu, Nov 6, 2008 at 6:33 PM,  <bugzilla-daemon at portal.open-bio.org> wrote:
>
>
>
>
> ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk  2008-11-06 12:33 EST -------
> (In reply to comment #8)
>>
>> ok
>> Actually I have been using files which come from our laboratory analysis,
>> and I would like to ask if I include them here and how first.
>
> If you can get permission to include a real example (and its not too big) that
> would be great.  Ideally something with at least three alleles.

ok..

>> > Do you have URL for the file format documentation?
>>
>> The fastphase format seems to be described only in fastphase's manual,
>> which is only accessible after accepting a license agreement.
>> I could contact the authors of the program to ask them to publish the format
>> specifications publicly. It would be in their interest, as otherwise the
>> format could be considered as a not standard.  I'll let you know.
>
> It's not very open, is it :(
>
> Are there any other tools that output this file format?  Do you think the
> author might be willing to just add an option to output the sequences in
> another format (e.g. FASTA, or better an alignment format designed for more
> than one alignment).  This would be a neater solution in the long run (and
> would benefit anyone using fastPhase - not just Biopython).

Not for my knowledge.
Anyway, consider that a fastPhase run could take days for medium/big samples.
In some situations it could be faster to convert its output to fasta
(or other ones) directly, instead of re-calculating the results.

>> > Are they always DNA for example, or is RNA also possible?
>>
>> They should be DNA, In principle they could be also genes, or other kind of
>> characters, but this software is designed for the purpose of reconstructing
>> haplotypes from SNPs/microsatellites.
>> Maybe Tiago has some more experience in this..
>
> If it is for DNA only, the sequences/alignments returned should ideally specify
> a DNA alphabet.

mmm ok...
Basically it could be used also with characters like genes and other
markers.. but in that case, it would not make sense to parse it as a
sequence, so nobody would try to do it.

>> Because that would mean that one individual has only a chromosome.
>> It doesn't make sense to run fastPhase on an haploid individual.
>
> Is fastPhase only for haploids?  Could it be used with polyploidy (e.g.
> plants)?

I think not... It would be another class of problem.
What fastPhase does, is trying to infer haplotypes from genotype data.

Humans and most eukaryotes are diploid, so they have two copies of
each chromosome; when you genotype markers, for every individuals, you
get two informations for each (e.g.  'AC' for a SNP).
Let's say you are studying two SNPs in an single individual: you will
have 'AC' for the first marker, and 'GT' for the second (you already
know that they are in the same chromosome).
You want to know which are the haplotypes, which means, if the 'A'
from the first SNP is on the same molecule of the 'G' from the second
SNP, and so on.

For example, you could have a chromosome with 'AG' and the other with
'CT'; or a chromosome with 'AT' and the other with 'CG', and fastPhase
tries to calculate which is the most likely (I won't be able to
explain all the details properly).

Moreover, fastPhase (there are other programs) can infer missing
genotype data, which is useful when you have big collections of SNPs.

That said, I don't know if it is able to infer haplotypes in polyploid
organisms, but I don't think so, as it would be a different class of
problem (more complex).
I thought that the best thing to do is to do not support poliploidy,
and if someone else that uses fastPhase to calculate that comes, it
would be easy to adapt the module for it (it would require to just add
an option)

>> > On the other hand, are these hand edited files which deliberately break the
>> > rules?
>>
>> Yes. Usually you shouldn't have neither of the two cases. But I find it
>> useful when a script tells me if there are weird things in my files (I
>> could have modified them accidentally).
>
> Yes - negative test cases are good.  However, having them as a doctest made the
> docstring rather confusing.

mmm I know, that doctest could be refactored.
I have started using test recently... I find it is a lot better.

>
>> > If fastPHASE files SHOULD always come in allele groups (of the same
>> > length), then it would be better to integrate the parser into Bio.AlignIO
>> > giving pairwise alignments (and you would be able to read it via Bio.SeqIO
>> > automatically as well).
>>
>> This is good idea, I didn't think of it.
>> But how should I modify the module to produce AlignIO objects?
>
> Essentially Instead of:
>
> yield record_one
> yield record_two
>
> you'd do something like this:
>
> alignment = Alignment(generic_dna)
> alignment.add_sequence(id_one, seq_one)
> alignment.add_sequence(id_two, seq_two)
> yield alignment

sounds easy :)

>
>> > P.S. Your suggested format name "fastPhaseOutput" breaks the lower case
>> > rule.  Would "fastphase" be OK, or is there more than one format?  e.g.
>> > an input format which might be confused with this?
>>
>> I agree.. I wasn't sure of biopython's naming conventions.
>>
>
> This is written down elsewhere - but the format name is a lowercase string (and
> this is enforced in the API), and the same names are used in both SeqIO and
> AlignIO. Where possible we use the same name as BioPerl's SeqIO and EMBOSS.
>
> (In reply to comment #9)
>> (In reply to comment #7)
>> > Finally could you try the -Z command line argument for the simplified output
>> > format (described as two lines per individual, without "id" lines,
>> > subpopulation labels or summary information from the run).  Does this have
>> > the sequences?  If so this may be a more parser friendly set of output to
>> > parse for Bio.SeqIO and/or Bio.AlignIO.
>>
>> ok, I can try to implement both of the two formats, but for the moment I will
>> prefer to concetrate on one.
>
> I was actually thinking the -Z format might be much simpler to deal with (I
> didn't mean to suggest supporting both).  On the other hand, the documentation
> does say the -Z is "not intended for general use".

The problem is that it could take days to run a fastPhase... most of
the times you want the longer format, and then proceed to parse it.
Anyway, it should not be a big problem to implement it (I am just
putting all of that information in SeqRecord.description)

>
> Peter
>
>
> --
> Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it