[Bioperl-l] How to assess sequencing quality with Bioperl?

Fri Sep 2 10:24:08 EDT 2005

Andrew,

Following is a manpage for a program I wrote using Bioperl (and phred).

If you'd like it, let me know, I'll slap our institute approved software
license on it an put a `make install`able tarball on a webpage
somewhere...

Malcolm Cook - mec at stowers-institute.org - 816-926-4449
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, MO  USA 

________________________________________________________________________
_________

SEQTRACESTATS(1)      User Contributed Perl Documentation
SEQTRACESTATS(1)

SYNOPSIS
       seqtracestats --bins=max1,max2,max3 --informat trace_format
--readname-
       convention attributelist=expression  --mtime daycount --abi2phd <
trace
       files > quality score summary

       Given a set of DNA sequence traces (or possibly a directory to
find
       them in based on their age), produce summary descriptive
statistics of
       the read quality (after optionally (re)base calling using phred).

OPTIONS
       Their case is ignored, and they may be abbreviated to uniqueness
(i.e.
       --v instead of --verbose).

       Options may be specified on the command line, and may optionally
also
       be read from files by providing on the command line the path to
the
       file preceded by a '@'.  These option files provide simple access
to
       typical calling scenarios (such as an analysis that is repeatedly
       invoked from the command line with the same parameters).
Additionally,
       if the current directory contains a file named
seqtracestats.config, it
       will be automatically used as an option file.

       --informat = PHD|SCF|ABI
           Format of input files.  Defaults to PHD.  Can be any
Bio::SeqIO
           format that contains quality values, such as SCF.

           TODO: currently only PHD works.  Get the staden integration
for ABI
           and SCF files working with BioPerl.

       --bins = n,m,o,p
           A comma separated list of numbers which should be in
increasing
           order.  They define the ranges into which the sequence
quality
           score for each base-pair read is binned; each number sets the
maxi-
           mum allowable value in the bin, with the minimum value being
deter-
           mined by the previous bin, if any.

           Note that any quality scores encountered which are greater
than the
           largest bin will not be included in any bin.  Otherwise, the
sum of
           the values in each of the bins will equal the entire sequence
           length.

           I have not encountered any documentation specifying a maximum
phred
           score assigned in practice by any base caller.  However, I
have
           never encountered a score, or a report of a score, greater
than 50.
           So, you can pretty safely use 60 as your highest bin, which
corre-
           sponds to an error probability of one in a million.

           Defaults to --bins=20,30,40,50,60.

       --mtimeatt timeattributes
           Allows specification of time attributes to appear in report.

       --readnameconvention attributelist=expression
           A method of decoding attributes such as are typically encoded
in
           the name of files holding seqreads.

           Implemented as an association between a comma delimited list
of
           attribute names and a perl expression which, when evaluated
yeilds
           corresponding values for each attribute.  The expression is
evalu-
           ated in the context of $_ being the filename holding a
seqread.

           Any attributes containing the word 'ignore' as a substring
are
           supressed from the summary.

           For example:

           "--readnameconvention='gene,library,plate,primer,platewellco-
           ord,=split /[-_.]/'"

           when reading the file: VPL3-15-I-BSB460_A01.seq

           will assign to the current read attributes as follows:

                   gene => VPL3
                   library => 15
                   plate => I
                   primer => BSB460
                   platewellcoord => A01

           Named aliases exist as shorthand methods for decoding
commonly used
           naming conventions.  The following are predefined:

           TODO: IMPLEMENT THESE - THEY ARE NOT.

           simr1
               Short for:
'(gene,library,plate,primer,platewellcoord)=split
               /[-_.]/'

       --mtime
           Interpret the input files as directorys to search for ab1
files
           created (modified) mtime days ago.  If none supplied,
defaults to
           current working directory.

           This value is used as --mtime argument to the unix find
command
           (along with -daystart), and as such may be prefixed with '-'
or
           '+', to mean 'less than' or 'greater than' mtime days ago.

           May be specified multiple times to produce data ranges.  I.e.
           --mtime -31 --mtime +0 will select files between 1 and 30
days old
           (inclusive)

           Implies --abi2phd

       --abi2phd
           Input files are ab1 chromatogram files and they should be
replaced
           with corresponding quality file, generated using phred as
needed.

           If there is an existing corresponding .phd (or .phd.1) file,
it
           will be used.

           Otherwise, the ab1 files will be (re) base-called by phred,
gener-
           ating sequence quality files in the same directory with
addition
           extension of ".phd".

           Note: if the input files already have corresponding '.phd' or
           '.phd.1' file, it will NOT be re-basecalled and the existing
QV
           scores will be used.

       --help
           Display command line usage with options.

       --man
           Display complete manual page and exit.

       --verbose
           Provides a trace of processing on STDERR.

       --version
           Display the scripts version number and exit.

DESCRIPTION
       Output is in csv format (comma separated values) and may be
visualied
       in Excel.

       TODO: Elaborate this description?

EXAMPLES
       seqtracestats --mtime 1 > seqtracestats.txt
           Summarize the result of run phred on all ab1 files created
yester-
           day which are subordinate to current directory, putting the
results
           in file seqtracestats.txt

       ./seqtracestats --mtime -10 '/n/facility/Genomics/Sequenc-
       ing/2003/Molecular\ Biology/' --readname 'platerow,plate-
       col=m/_([A-H])(\d{2})\./' --readname
',,submitter,lab=reverse(split
       /\//)'
           Summarize reads less than 10 days old, parsing out platerow
and
           plate col from the read filename, and interpreting selected
direc-
           tory components as the submitter of the sequencing request,
and the
           lab of which s/he is a member.

       seqtracestats --mtime +1,-30 ./2003 > ./2003/febqual.txt
           Summarize trace files that are less than 30 but more than 1
days
           old which are within the directory ./2003 which is in turn
within
           the current directory.  Put the results in
./2003/febqual.txt.

           Note: if this example is preceded by

           cd /n/facility/Genomics/Sequencing/

           then additional parameters will be picked up from the file
seq-
           tracestats.config in that directory.  This file implements
the
           standard read naming conventions.

       seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
--read-
       nameconvention='gene,primer,platenumber,platecoord,=split
/[-_.]/'
           Result of summarize the PHD formatted files into three bins,
decod-
           ing the selected attributes from the read name:

trace_file_path,display_id,gene,primer,platenumber,platecoord,clone,vect
or,FR,plate_id,platerow,platecol,length,gc_content,mean,variance,20,30,6
0,20,30,60

./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13F,20,
A01,MC1081,M13,F,20,A,01,807,0.563816604708798,39.2094175960347,118.5479
01273288,54,83,670,54,83,670

./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
57077655,54,76,672,54,76,672

./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
57077655,54,76,672,54,76,672
            ...

       seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
--read-
       nameconvention='clone,vector,FR,plate_id,platerow,plate-
       col=m/^([^\-]+)-(\w+)([FR])-(\d+)_(\w{1})(\d+)\..*$/'
           Result of running on the same data but parsing the readname
differ-
           ently, including pulling out read direction and plate row and
col-
           umn:

trace_file_path,display_id,clone,vector,FR,plate_id,platerow,platecol,le
ngth,gc_content,mean,variance,num bp q<=20,num bp q<=30,num bp q<=60

./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13,F,20
,A,01,807,0.563816604708798,39.2094175960347,118.547901273288,54,83,670

./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13,R,48
,A,01,802,0.553615960099751,39.2418952618453,113.746657077655,54,76,672

./t/data/MC1082-M13F-20_B01.phd.1,MC1082-M13F-20_B01.ab1,MC1082,M13,F,20
,B,01,816,0.498774509803922,29.9803921568627,105.805750030073,143,236,43
7
            ...

       seqtracestats ./t/data/*.phd.1 @./eg.config
           Process those same files, taking command line options from
eg.con-
           fig in the current directory.

       seqtracestats ./t/data/*.phd.1
           Process those same file, taking command line options from
seq-
           tracestats.config in the current working directory.

AUTHOR
       Malcolm Cook (mec at stowers-institute.org)

DEPENDENCIES
       perl, BioPerl, bioperl-ext,  Time::Seconds,
Statistics::Descriptive,
       /usr/local/bin/find

       Time::Piece

AVAILABILITY
       Email the author for sources.

VERSION
       $Revision: 1.4 $

TODO
       include support for ABI files
       support other output formats & destinations?  tab v. csv format
v.
       write to (mysql?) database.  Use DBD::AnyData for implementation?
       create textual report similar to ABI's software
       allow for reporting on trimmed sequences
       contibute patch to Bio::Tools::SeqStats to BioPerl

REFERENCES
       see the unix command trev, part of staden package, for viewing
.scf
       files under unix
       regarding the conversion of ABI files using SEQIO

http://open-bio.org/pipermail/bioperl-l/2003-January/010749.html -

perl v5.8.6                       2005-09-02
SEQTRACESTATS(1)