[Bioperl-l] How to assess sequencing quality with Bioperl?
Cook, Malcolm
MEC at Stowers-Institute.org
Fri Sep 2 10:24:08 EDT 2005
Andrew,
Following is a manpage for a program I wrote using Bioperl (and phred).
If you'd like it, let me know, I'll slap our institute approved software
license on it an put a `make install`able tarball on a webpage
somewhere...
Malcolm Cook - mec at stowers-institute.org - 816-926-4449
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, MO USA
________________________________________________________________________
_________
SEQTRACESTATS(1) User Contributed Perl Documentation
SEQTRACESTATS(1)
SYNOPSIS
seqtracestats --bins=max1,max2,max3 --informat trace_format
--readname-
convention attributelist=expression --mtime daycount --abi2phd <
trace
files > quality score summary
Given a set of DNA sequence traces (or possibly a directory to
find
them in based on their age), produce summary descriptive
statistics of
the read quality (after optionally (re)base calling using phred).
OPTIONS
Their case is ignored, and they may be abbreviated to uniqueness
(i.e.
--v instead of --verbose).
Options may be specified on the command line, and may optionally
also
be read from files by providing on the command line the path to
the
file preceded by a '@'. These option files provide simple access
to
typical calling scenarios (such as an analysis that is repeatedly
invoked from the command line with the same parameters).
Additionally,
if the current directory contains a file named
seqtracestats.config, it
will be automatically used as an option file.
--informat = PHD|SCF|ABI
Format of input files. Defaults to PHD. Can be any
Bio::SeqIO
format that contains quality values, such as SCF.
TODO: currently only PHD works. Get the staden integration
for ABI
and SCF files working with BioPerl.
--bins = n,m,o,p
A comma separated list of numbers which should be in
increasing
order. They define the ranges into which the sequence
quality
score for each base-pair read is binned; each number sets the
maxi-
mum allowable value in the bin, with the minimum value being
deter-
mined by the previous bin, if any.
Note that any quality scores encountered which are greater
than the
largest bin will not be included in any bin. Otherwise, the
sum of
the values in each of the bins will equal the entire sequence
length.
I have not encountered any documentation specifying a maximum
phred
score assigned in practice by any base caller. However, I
have
never encountered a score, or a report of a score, greater
than 50.
So, you can pretty safely use 60 as your highest bin, which
corre-
sponds to an error probability of one in a million.
Defaults to --bins=20,30,40,50,60.
--mtimeatt timeattributes
Allows specification of time attributes to appear in report.
--readnameconvention attributelist=expression
A method of decoding attributes such as are typically encoded
in
the name of files holding seqreads.
Implemented as an association between a comma delimited list
of
attribute names and a perl expression which, when evaluated
yeilds
corresponding values for each attribute. The expression is
evalu-
ated in the context of $_ being the filename holding a
seqread.
Any attributes containing the word 'ignore' as a substring
are
supressed from the summary.
For example:
"--readnameconvention='gene,library,plate,primer,platewellco-
ord,=split /[-_.]/'"
when reading the file: VPL3-15-I-BSB460_A01.seq
will assign to the current read attributes as follows:
gene => VPL3
library => 15
plate => I
primer => BSB460
platewellcoord => A01
Named aliases exist as shorthand methods for decoding
commonly used
naming conventions. The following are predefined:
TODO: IMPLEMENT THESE - THEY ARE NOT.
simr1
Short for:
'(gene,library,plate,primer,platewellcoord)=split
/[-_.]/'
--mtime
Interpret the input files as directorys to search for ab1
files
created (modified) mtime days ago. If none supplied,
defaults to
current working directory.
This value is used as --mtime argument to the unix find
command
(along with -daystart), and as such may be prefixed with '-'
or
'+', to mean 'less than' or 'greater than' mtime days ago.
May be specified multiple times to produce data ranges. I.e.
--mtime -31 --mtime +0 will select files between 1 and 30
days old
(inclusive)
Implies --abi2phd
--abi2phd
Input files are ab1 chromatogram files and they should be
replaced
with corresponding quality file, generated using phred as
needed.
If there is an existing corresponding .phd (or .phd.1) file,
it
will be used.
Otherwise, the ab1 files will be (re) base-called by phred,
gener-
ating sequence quality files in the same directory with
addition
extension of ".phd".
Note: if the input files already have corresponding '.phd' or
'.phd.1' file, it will NOT be re-basecalled and the existing
QV
scores will be used.
--help
Display command line usage with options.
--man
Display complete manual page and exit.
--verbose
Provides a trace of processing on STDERR.
--version
Display the scripts version number and exit.
DESCRIPTION
Output is in csv format (comma separated values) and may be
visualied
in Excel.
TODO: Elaborate this description?
EXAMPLES
seqtracestats --mtime 1 > seqtracestats.txt
Summarize the result of run phred on all ab1 files created
yester-
day which are subordinate to current directory, putting the
results
in file seqtracestats.txt
./seqtracestats --mtime -10 '/n/facility/Genomics/Sequenc-
ing/2003/Molecular\ Biology/' --readname 'platerow,plate-
col=m/_([A-H])(\d{2})\./' --readname
',,submitter,lab=reverse(split
/\//)'
Summarize reads less than 10 days old, parsing out platerow
and
plate col from the read filename, and interpreting selected
direc-
tory components as the submitter of the sequencing request,
and the
lab of which s/he is a member.
seqtracestats --mtime +1,-30 ./2003 > ./2003/febqual.txt
Summarize trace files that are less than 30 but more than 1
days
old which are within the directory ./2003 which is in turn
within
the current directory. Put the results in
./2003/febqual.txt.
Note: if this example is preceded by
cd /n/facility/Genomics/Sequencing/
then additional parameters will be picked up from the file
seq-
tracestats.config in that directory. This file implements
the
standard read naming conventions.
seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
--read-
nameconvention='gene,primer,platenumber,platecoord,=split
/[-_.]/'
Result of summarize the PHD formatted files into three bins,
decod-
ing the selected attributes from the read name:
trace_file_path,display_id,gene,primer,platenumber,platecoord,clone,vect
or,FR,plate_id,platerow,platecol,length,gc_content,mean,variance,20,30,6
0,20,30,60
./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13F,20,
A01,MC1081,M13,F,20,A,01,807,0.563816604708798,39.2094175960347,118.5479
01273288,54,83,670,54,83,670
./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
57077655,54,76,672,54,76,672
./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
57077655,54,76,672,54,76,672
...
seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
--read-
nameconvention='clone,vector,FR,plate_id,platerow,plate-
col=m/^([^\-]+)-(\w+)([FR])-(\d+)_(\w{1})(\d+)\..*$/'
Result of running on the same data but parsing the readname
differ-
ently, including pulling out read direction and plate row and
col-
umn:
trace_file_path,display_id,clone,vector,FR,plate_id,platerow,platecol,le
ngth,gc_content,mean,variance,num bp q<=20,num bp q<=30,num bp q<=60
./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13,F,20
,A,01,807,0.563816604708798,39.2094175960347,118.547901273288,54,83,670
./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13,R,48
,A,01,802,0.553615960099751,39.2418952618453,113.746657077655,54,76,672
./t/data/MC1082-M13F-20_B01.phd.1,MC1082-M13F-20_B01.ab1,MC1082,M13,F,20
,B,01,816,0.498774509803922,29.9803921568627,105.805750030073,143,236,43
7
...
seqtracestats ./t/data/*.phd.1 @./eg.config
Process those same files, taking command line options from
eg.con-
fig in the current directory.
seqtracestats ./t/data/*.phd.1
Process those same file, taking command line options from
seq-
tracestats.config in the current working directory.
AUTHOR
Malcolm Cook (mec at stowers-institute.org)
DEPENDENCIES
perl, BioPerl, bioperl-ext, Time::Seconds,
Statistics::Descriptive,
/usr/local/bin/find
Time::Piece
AVAILABILITY
Email the author for sources.
VERSION
$Revision: 1.4 $
TODO
include support for ABI files
support other output formats & destinations? tab v. csv format
v.
write to (mysql?) database. Use DBD::AnyData for implementation?
create textual report similar to ABI's software
allow for reporting on trimmed sequences
contibute patch to Bio::Tools::SeqStats to BioPerl
REFERENCES
see the unix command trev, part of staden package, for viewing
.scf
files under unix
regarding the conversion of ABI files using SEQIO
http://open-bio.org/pipermail/bioperl-l/2003-January/010749.html -
perl v5.8.6 2005-09-02
SEQTRACESTATS(1)
More information about the Bioperl-l
mailing list