[Bioperl-l] RFC: SNP::Inherit

Thu Apr 29 20:26:23 UTC 2010

Dear Bioperl community,

I was thinking of uploading a module to CPAN that converts SNP genotype data
to parental allele designations. Below is the perldoc. This is not a
"BioPerl" module per se, so I'm  not sure what namespace to put it under.

I would be glad to send anyone the source if they are interested in checking
it out more. I just did not want to send everyone an unsolicited attachment.

Thank you for your time,
Christopher Bottoms (molecules)

NAME
    SNP::Inherit - Module for determining the parental origin of specific
    SNPs based on genotype data.

VERSION
    Version 0.0010_0001

SYNOPSIS
        my $foo = SNP::Inherit->new(
            manifest_filename => 'manifest.tab',
            data_filename     => 'data.tab'
        );

        #Upon object construction, this outputs a summary file
        #   'data.tab_summary.tab' and a detailed file 'data.tab_abh.tab'
        #   containing parental allele designations for each sample that has

        #   parents defined for it in the manifest file

DESCRIPTION
    This is a module for converting Single Nucleotide Polymorphism (SNP)
    genotype data to parental allele designations. This helps with creating
    files suitable for mapping, identifying and characterizing crossovers,
    and also helps with quality control.

SUBROUTINES/METHODS
  BUILD
        Since the integrity of the data in the manifest file is absolutely
vital,
        building an object fails if there are duplicate sample ids in the
        manifest file.

ATTRIBUTES
  manifest_filename
        Name of the file containing information for each sample id

        Required in the constructor

        The first line contains headers and the remaining lines contain
            tab-delimited fields in the following order:

            sample id     or "Institute Sample Label"    (e.g.
"WG0096796-DNAA05" )
            sample name   or "Sample name"               (e.g.
"B73xB97"          )
            group name    or "Group"                     (e.g. "NAM
F1"           )
            parentA       or "Mother"                    (e.g.
"WG0096795-DNAA01" )
            parentB       or "Father"                    (e.g.
"WG0096796-DNAF01" )
            replicate of  or "Replicate(s)"    (id of sample that this
replicates
                                                  e.g.
"WG0096796-DNAA05"         )
            AxB F1        or "F1 of parentA and parentB" (e.g.
"WG0096795-DNAA02" )

        The last four fields can be blank, if they are not applicable.
However,
            being blank when they are applicable will result in failure of
the
            program to analyze the data properly

  data_filename
        Name of the tab-delimited file containing the data to be processed.

        Required in the constructor.

        The text '[Data]' in a line indicates that remaining lines are all
data.
        The next line contains column headers, which are in fact the sample
ids.
            Sample ids missing from the manifest file will not be processed.
        The next line contains the name of the SNP in the first field and
data in
            the remaining fields.

        Data must be in the format of SNP_name{tab}AA{tab}GG{tab}.

OUTPUT FILES
        Upon object construction, two files are produced: one that
summarizes the
        input and another that that describes the genotypes of samples in
terms of
        their "parents". For example, a sample with a genotype of "CG" whose
        'parentA' has a genotype of "CC" and whose 'parentB' has a genotype
of
        "GG" would have a heterozygous genotype, labeled as 'H'.

        Here are the possible allele designations that result:

            Allele designations for informative genotypes:
                A = parentA genotype
                B = parentB genotype
                H = heterozygous genotype

            Allele designations for noninformative genotypes:
                ~ = nonpolymorphic parents (i.e. both parents have same
genotype)
                - = missing data
                -- = missing data for at least one parental
                % = polymorphic parent

            Error codes:
                # = conflict of nonpolymorphic expectation, meaning both
parents
                        have the same genotype, but the sample has a
different
                        genotype. For example, parentA and parentB both have
the
                        genotype 'CC', but the sample has a genotype of
'TT'.

                ! = nonparental genotype, meaning each parent has a
different
                        genotype, but the sample has at least one allele not
seen
                        in either parent. For example, getting 'AG' for the
                        offspring when the parents have 'GG' and 'TT'.
                        (This should not even be seen when the data was
obtained
                        from a biallelic assay.)

                !! = genotype of the F1 for parentA x parentB is incongruent
with
                        the genotype for parentA

        See the bundled tests for examples.

TODO
        Output report detailing which samples have been processed and in
what way.
        Also give descendents and ancestor relationships.

        Document ability to process files using F1 and parentA info (i.e. in
the
        absence of parentB info).

        Add simple means of adding map info so that distances and
chromosomes are
        output along with the marker names.

        Give crossover info?

        Give introgressions/regions attributable to specific ancestor(s).

        Use benchmarking to find out which (if any) to memoize:
        _nonredundant_chars
        _trim
        _is_comprised_from
        _sorted_characters
        _sort_and_join
        _chars_from
        _sorted_first_two_char

        Test bad file names

DIAGNOSTICS
        TODO

CONFIGURATION AND ENVIRONMENT
       TODO

DEPENDENCIES
       TODO

INCOMPATIBILITIES
       TODO

BUGS
    Please report any you find. None have been reported as of the current
    release.

LIMITATIONS
    Be consciencious with the preparation of your input files (i.e. manifest
    file and data file). Correct results depend on correct input files.

AUTHOR
    Christopher Bottoms, "<molecules at cpan.org>"

SUPPORT
    You can find documentation for this module with the perldoc command.

        perldoc SNP::Inherit

ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
    This program is free software; you can redistribute it and/or modify it
    under the terms of either: the GNU General Public License as published
    by the Free Software Foundation; or the Artistic License.

    See http://dev.perl.org/licenses/ for more information.

    Copyright 2010 Christopher Bottoms.