[Bioperl-l] RFC: SNP::Inherit
Christopher Bottoms
maizemu at gmail.com
Thu Apr 29 20:26:23 UTC 2010
Dear Bioperl community,
I was thinking of uploading a module to CPAN that converts SNP genotype data
to parental allele designations. Below is the perldoc. This is not a
"BioPerl" module per se, so I'm not sure what namespace to put it under.
I would be glad to send anyone the source if they are interested in checking
it out more. I just did not want to send everyone an unsolicited attachment.
Thank you for your time,
Christopher Bottoms (molecules)
NAME
SNP::Inherit - Module for determining the parental origin of specific
SNPs based on genotype data.
VERSION
Version 0.0010_0001
SYNOPSIS
my $foo = SNP::Inherit->new(
manifest_filename => 'manifest.tab',
data_filename => 'data.tab'
);
#Upon object construction, this outputs a summary file
# 'data.tab_summary.tab' and a detailed file 'data.tab_abh.tab'
# containing parental allele designations for each sample that has
# parents defined for it in the manifest file
DESCRIPTION
This is a module for converting Single Nucleotide Polymorphism (SNP)
genotype data to parental allele designations. This helps with creating
files suitable for mapping, identifying and characterizing crossovers,
and also helps with quality control.
SUBROUTINES/METHODS
BUILD
Since the integrity of the data in the manifest file is absolutely
vital,
building an object fails if there are duplicate sample ids in the
manifest file.
ATTRIBUTES
manifest_filename
Name of the file containing information for each sample id
Required in the constructor
The first line contains headers and the remaining lines contain
tab-delimited fields in the following order:
sample id or "Institute Sample Label" (e.g.
"WG0096796-DNAA05" )
sample name or "Sample name" (e.g.
"B73xB97" )
group name or "Group" (e.g. "NAM
F1" )
parentA or "Mother" (e.g.
"WG0096795-DNAA01" )
parentB or "Father" (e.g.
"WG0096796-DNAF01" )
replicate of or "Replicate(s)" (id of sample that this
replicates
e.g.
"WG0096796-DNAA05" )
AxB F1 or "F1 of parentA and parentB" (e.g.
"WG0096795-DNAA02" )
The last four fields can be blank, if they are not applicable.
However,
being blank when they are applicable will result in failure of
the
program to analyze the data properly
data_filename
Name of the tab-delimited file containing the data to be processed.
Required in the constructor.
The text '[Data]' in a line indicates that remaining lines are all
data.
The next line contains column headers, which are in fact the sample
ids.
Sample ids missing from the manifest file will not be processed.
The next line contains the name of the SNP in the first field and
data in
the remaining fields.
Data must be in the format of SNP_name{tab}AA{tab}GG{tab}.
OUTPUT FILES
Upon object construction, two files are produced: one that
summarizes the
input and another that that describes the genotypes of samples in
terms of
their "parents". For example, a sample with a genotype of "CG" whose
'parentA' has a genotype of "CC" and whose 'parentB' has a genotype
of
"GG" would have a heterozygous genotype, labeled as 'H'.
Here are the possible allele designations that result:
Allele designations for informative genotypes:
A = parentA genotype
B = parentB genotype
H = heterozygous genotype
Allele designations for noninformative genotypes:
~ = nonpolymorphic parents (i.e. both parents have same
genotype)
- = missing data
-- = missing data for at least one parental
% = polymorphic parent
Error codes:
# = conflict of nonpolymorphic expectation, meaning both
parents
have the same genotype, but the sample has a
different
genotype. For example, parentA and parentB both have
the
genotype 'CC', but the sample has a genotype of
'TT'.
! = nonparental genotype, meaning each parent has a
different
genotype, but the sample has at least one allele not
seen
in either parent. For example, getting 'AG' for the
offspring when the parents have 'GG' and 'TT'.
(This should not even be seen when the data was
obtained
from a biallelic assay.)
!! = genotype of the F1 for parentA x parentB is incongruent
with
the genotype for parentA
See the bundled tests for examples.
TODO
Output report detailing which samples have been processed and in
what way.
Also give descendents and ancestor relationships.
Document ability to process files using F1 and parentA info (i.e. in
the
absence of parentB info).
Add simple means of adding map info so that distances and
chromosomes are
output along with the marker names.
Give crossover info?
Give introgressions/regions attributable to specific ancestor(s).
Use benchmarking to find out which (if any) to memoize:
_nonredundant_chars
_trim
_is_comprised_from
_sorted_characters
_sort_and_join
_chars_from
_sorted_first_two_char
Test bad file names
DIAGNOSTICS
TODO
CONFIGURATION AND ENVIRONMENT
TODO
DEPENDENCIES
TODO
INCOMPATIBILITIES
TODO
BUGS
Please report any you find. None have been reported as of the current
release.
LIMITATIONS
Be consciencious with the preparation of your input files (i.e. manifest
file and data file). Correct results depend on correct input files.
AUTHOR
Christopher Bottoms, "<molecules at cpan.org>"
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc SNP::Inherit
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
Copyright 2010 Christopher Bottoms.
More information about the Bioperl-l
mailing list