[Biopython-dev] [Bug 2643] Proposal: fastPhaseOutputIO for SeqIO

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Nov 6 17:33:12 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2643





------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk  2008-11-06 12:33 EST -------
(In reply to comment #8)
> 
> ok
> Actually I have been using files which come from our laboratory analysis,
> and I would like to ask if I include them here and how first.

If you can get permission to include a real example (and its not too big) that
would be great.  Ideally something with at least three alleles.

> > Do you have URL for the file format documentation?  
> 
> The fastphase format seems to be described only in fastphase's manual,
> which is only accessible after accepting a license agreement.
> I could contact the authors of the program to ask them to publish the format
> specifications publicly. It would be in their interest, as otherwise the
> format could be considered as a not standard.  I'll let you know.

It's not very open, is it :(

Are there any other tools that output this file format?  Do you think the
author might be willing to just add an option to output the sequences in
another format (e.g. FASTA, or better an alignment format designed for more
than one alignment).  This would be a neater solution in the long run (and
would benefit anyone using fastPhase - not just Biopython).

> > Are they always DNA for example, or is RNA also possible?
> 
> They should be DNA, In principle they could be also genes, or other kind of
> characters, but this software is designed for the purpose of reconstructing
> haplotypes from SNPs/microsatellites.
> Maybe Tiago has some more experience in this..

If it is for DNA only, the sequences/alignments returned should ideally specify
a DNA alphabet.

> ...
> Because that would mean that one individual has only a chromosome.
> It doesn't make sense to run fastPhase on an haploid individual.

Is fastPhase only for haploids?  Could it be used with polyploidy (e.g.
plants)?

> > On the other hand, are these hand edited files which deliberately break the
> > rules?  
> 
> Yes. Usually you shouldn't have neither of the two cases. But I find it
> useful when a script tells me if there are weird things in my files (I
> could have modified them accidentally).

Yes - negative test cases are good.  However, having them as a doctest made the
docstring rather confusing.

> > If fastPHASE files SHOULD always come in allele groups (of the same
> > length), then it would be better to integrate the parser into Bio.AlignIO
> > giving pairwise alignments (and you would be able to read it via Bio.SeqIO
> > automatically as well).
> 
> This is good idea, I didn't think of it.
> But how should I modify the module to produce AlignIO objects?

Essentially Instead of:

yield record_one
yield record_two

you'd do something like this:

alignment = Alignment(generic_dna)
alignment.add_sequence(id_one, seq_one)
alignment.add_sequence(id_two, seq_two)
yield alignment

> > P.S. Your suggested format name "fastPhaseOutput" breaks the lower case
> > rule.  Would "fastphase" be OK, or is there more than one format?  e.g.
> > an input format which might be confused with this?
> 
> I agree.. I wasn't sure of biopython's naming conventions.
> 

This is written down elsewhere - but the format name is a lowercase string (and
this is enforced in the API), and the same names are used in both SeqIO and
AlignIO. Where possible we use the same name as BioPerl's SeqIO and EMBOSS.

(In reply to comment #9)
> (In reply to comment #7)
> > Finally could you try the -Z command line argument for the simplified output
> > format (described as two lines per individual, without “id” lines,
> > subpopulation labels or summary information from the run).  Does this have
> > the sequences?  If so this may be a more parser friendly set of output to
> > parse for Bio.SeqIO and/or Bio.AlignIO.
> 
> ok, I can try to implement both of the two formats, but for the moment I will
> prefer to concetrate on one.

I was actually thinking the -Z format might be much simpler to deal with (I
didn't mean to suggest supporting both).  On the other hand, the documentation
does say the -Z is "not intended for general use".

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list