[BioPython] [PopGen] a random Haplotype Sets generator

Thu Nov 13 18:57:30 UTC 2008

Giovanni Marco Dall'Olio wrote:
> On Thu, Nov 13, 2008 at 4:29 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>   
>> Tiago Antão wrote:
>>     
>>>> This is right: which word can I use, then?
>>>> HaplotypesSampler? RandomHaplotypesSpawner?
>>>> HaplotypesCreator?
>>>>
>>>>         
>>> Considering that this is probably a small piece of code in the long
>>> run (correct me if I am wrong), I suggest creating
>>> Bio.PopGen.Utils.NameToBeDecided.py
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>
>>>
>>>       
>> Hi,
>> I really don't mean to be negative, but you have certain responsibilities
>> once you release code into the Biopython community. Part of my concern is
>> that some of this is being overlooked especially in terms of the user of the
>> code. I do see that simulation of SNPs is useful for users so it is
>> important that it integrated correctly.
>>
>> I think Michiel's recent comment in 'a sequence set object in biopython'
>> thread is important here as well:
>>
>> "Adding new classes to Biopython should be done very carefully ... once
>> they're in, it's difficult to remove them again. In the past, removing
>> classes that turned out to be less than ideal was a real headache."
>>
>> While I have not looked at the code, my view is that must remain integrated
>> into the PopGen module. I would expect that a user would some Biopython
>> (PopGen) modules with some simulated SNPs. I would prefer that Biopython
>> remains as much as possible a set of integrated tools rather than just a
>> collection of tools. This is a clear example where if it is not totally
>> integrated then I don't see the point in including it in Biopython.
>>
>> The second aspect is that it must have a very stable API, similarly to
>> Michiel's comment is that changing APIs after a release is also a pain
>> especially if the module has been around a long time. Based on your first
>> post, I would argue that you are not quite at this stage yet.
>>     
>
> ehi, wait :) I wasn't proposing to integrate this module in biopython,
> at least not yet!! :)
>   
Oh, I am on the right list? It does say Biopython... :-)

> This is a module to generate test sets to help the development of the
> other future PopGen modules.
>   
Great!

> For example, we wanted to write a function to calculate the Fst
> statistics over snps data.
> The Fst is an index that tells you if, given two populations, they
> follow the same pattern of variability, and therefore can be
> considered as two subpopulations of the same population or not.
> To test such a script, you will need a module like the one I wrote
> here: for example, you could create two samples of 200 individuals
> with the same frequencies at every site, and see what your Fst script
> tells. Then, probably, compare the results with another tool that is
> already know to calculate the Fst correctly.
>
> So I was just asking for any suggestions - which models should I
> implement in this generator? And how? Which parameters should it
> accept? Should it use the random module?
>
>   
The importance is more the API than the actual implementation - as the 
later posts by Tiago indicate.

Some coding related comments:
freqs_per_site and alleles_per_site are lists.
This is a problem because these could get very large, it is inflexible 
and you could become out of sync.

While you do check for length, you should be more informative of which 
has a different length.
Also you need to check for valid inputs (frequencies between 0 and 1, 
bases in ACGT).

Some other comments

Perhaps I misunderstood the situation but the major problem that I have 
is that the locations are treated as independent so your model assumes 
unlinked loci. I just don't find this a useful scenario.

You assume that the user knows exactly which locations and frequency to 
change. Often you just want a random frequency and random location. In 
that case you need to randomly select locations and frequencies based on 
some function. But I do not find the mode=='random' of paramsGenerator 
sufficient to address this. Further, you might want a random sequence of 
some length but you not want all locations to change. While you could 
set those locations to zero, a more sparse form would be desirable. 
Also, the randomly generated frequencies should have a way to be limited 
in other ranges than the [0 to 1) of random.random. Obviously the 
question is whether or not the user has to do it themselves.

One particular use of generating SNPs pertains to known genes or 
sequences.  In such cases to would be great to use a known sequence as a 
base for the simulation. Further, it would be very useful be able 
incorporate known SNP data especially frequencies from some source like 
Hapmap (http://www.hapmap.org/). A nice but harder problem is to do this 
based on a protein sequence since many diseases refer to amino acids.

Perhaps my biggest 'disappointment' is the lack of ancestry control 
because I also interested in families or some admixture in a population. 
This just generates sequences randomly assuming you are randomly 
selecting individuals from a homogenous population. I do understand this 
usage so it is not that important to include this here.

Bruce