[Bioperl-l] How to assess sequencing quality with Bioperl?

Fri Sep 2 11:34:07 EDT 2005

There is also a program from TIGR called 'lucy' that can be used to 
process the output from phred.  The Bioperl module for dealing with lucy 
files is Bio::Tools::Lucy.  If you use it, make sure to check out the 
most recent version in CVS, as a bug was recently fixed.

lucy:
http://www.tigr.org/software/sequencing.shtml

Andrew

Cook, Malcolm wrote:
> Andrew,
> 
> Following is a manpage for a program I wrote using Bioperl (and phred).
> 
> If you'd like it, let me know, I'll slap our institute approved software
> license on it an put a `make install`able tarball on a webpage
> somewhere...
> 
> Malcolm Cook - mec at stowers-institute.org - 816-926-4449
> Database Applications Manager - Bioinformatics
> Stowers Institute for Medical Research - Kansas City, MO  USA 
> 
> ________________________________________________________________________
> _________
> 
> SEQTRACESTATS(1)      User Contributed Perl Documentation
> SEQTRACESTATS(1)
> 
> 
> 
> SYNOPSIS
>        seqtracestats --bins=max1,max2,max3 --informat trace_format
> --readname-
>        convention attributelist=expression  --mtime daycount --abi2phd <
> trace
>        files > quality score summary
> 
>        Given a set of DNA sequence traces (or possibly a directory to
> find
>        them in based on their age), produce summary descriptive
> statistics of
>        the read quality (after optionally (re)base calling using phred).
> 
> OPTIONS
>        Their case is ignored, and they may be abbreviated to uniqueness
> (i.e.
>        --v instead of --verbose).
> 
>        Options may be specified on the command line, and may optionally
> also
>        be read from files by providing on the command line the path to
> the
>        file preceded by a '@'.  These option files provide simple access
> to
>        typical calling scenarios (such as an analysis that is repeatedly
>        invoked from the command line with the same parameters).
> Additionally,
>        if the current directory contains a file named
> seqtracestats.config, it
>        will be automatically used as an option file.
> 
>        --informat = PHD|SCF|ABI
>            Format of input files.  Defaults to PHD.  Can be any
> Bio::SeqIO
>            format that contains quality values, such as SCF.
> 
>            TODO: currently only PHD works.  Get the staden integration
> for ABI
>            and SCF files working with BioPerl.
> 
>        --bins = n,m,o,p
>            A comma separated list of numbers which should be in
> increasing
>            order.  They define the ranges into which the sequence
> quality
>            score for each base-pair read is binned; each number sets the
> maxi-
>            mum allowable value in the bin, with the minimum value being
> deter-
>            mined by the previous bin, if any.
> 
>            Note that any quality scores encountered which are greater
> than the
>            largest bin will not be included in any bin.  Otherwise, the
> sum of
>            the values in each of the bins will equal the entire sequence
>            length.
> 
>            I have not encountered any documentation specifying a maximum
> phred
>            score assigned in practice by any base caller.  However, I
> have
>            never encountered a score, or a report of a score, greater
> than 50.
>            So, you can pretty safely use 60 as your highest bin, which
> corre-
>            sponds to an error probability of one in a million.
> 
>            Defaults to --bins=20,30,40,50,60.
> 
>        --mtimeatt timeattributes
>            Allows specification of time attributes to appear in report.
> 
>        --readnameconvention attributelist=expression
>            A method of decoding attributes such as are typically encoded
> in
>            the name of files holding seqreads.
> 
>            Implemented as an association between a comma delimited list
> of
>            attribute names and a perl expression which, when evaluated
> yeilds
>            corresponding values for each attribute.  The expression is
> evalu-
>            ated in the context of $_ being the filename holding a
> seqread.
> 
>            Any attributes containing the word 'ignore' as a substring
> are
>            supressed from the summary.
> 
>            For example:
> 
>            "--readnameconvention='gene,library,plate,primer,platewellco-
>            ord,=split /[-_.]/'"
> 
>            when reading the file: VPL3-15-I-BSB460_A01.seq
> 
>            will assign to the current read attributes as follows:
> 
>                    gene => VPL3
>                    library => 15
>                    plate => I
>                    primer => BSB460
>                    platewellcoord => A01
> 
>            Named aliases exist as shorthand methods for decoding
> commonly used
>            naming conventions.  The following are predefined:
> 
>            TODO: IMPLEMENT THESE - THEY ARE NOT.
> 
>            simr1
>                Short for:
> '(gene,library,plate,primer,platewellcoord)=split
>                /[-_.]/'
> 
>        --mtime
>            Interpret the input files as directorys to search for ab1
> files
>            created (modified) mtime days ago.  If none supplied,
> defaults to
>            current working directory.
> 
>            This value is used as --mtime argument to the unix find
> command
>            (along with -daystart), and as such may be prefixed with '-'
> or
>            '+', to mean 'less than' or 'greater than' mtime days ago.
> 
>            May be specified multiple times to produce data ranges.  I.e.
>            --mtime -31 --mtime +0 will select files between 1 and 30
> days old
>            (inclusive)
> 
>            Implies --abi2phd
> 
>        --abi2phd
>            Input files are ab1 chromatogram files and they should be
> replaced
>            with corresponding quality file, generated using phred as
> needed.
> 
>            If there is an existing corresponding .phd (or .phd.1) file,
> it
>            will be used.
> 
>            Otherwise, the ab1 files will be (re) base-called by phred,
> gener-
>            ating sequence quality files in the same directory with
> addition
>            extension of ".phd".
> 
>            Note: if the input files already have corresponding '.phd' or
>            '.phd.1' file, it will NOT be re-basecalled and the existing
> QV
>            scores will be used.
> 
>        --help
>            Display command line usage with options.
> 
>        --man
>            Display complete manual page and exit.
> 
>        --verbose
>            Provides a trace of processing on STDERR.
> 
>        --version
>            Display the scripts version number and exit.
> 
> DESCRIPTION
>        Output is in csv format (comma separated values) and may be
> visualied
>        in Excel.
> 
>        TODO: Elaborate this description?
> 
> EXAMPLES
>        seqtracestats --mtime 1 > seqtracestats.txt
>            Summarize the result of run phred on all ab1 files created
> yester-
>            day which are subordinate to current directory, putting the
> results
>            in file seqtracestats.txt
> 
>        ./seqtracestats --mtime -10 '/n/facility/Genomics/Sequenc-
>        ing/2003/Molecular\ Biology/' --readname 'platerow,plate-
>        col=m/_([A-H])(\d{2})\./' --readname
> ',,submitter,lab=reverse(split
>        /\//)'
>            Summarize reads less than 10 days old, parsing out platerow
> and
>            plate col from the read filename, and interpreting selected
> direc-
>            tory components as the submitter of the sequencing request,
> and the
>            lab of which s/he is a member.
> 
>        seqtracestats --mtime +1,-30 ./2003 > ./2003/febqual.txt
>            Summarize trace files that are less than 30 but more than 1
> days
>            old which are within the directory ./2003 which is in turn
> within
>            the current directory.  Put the results in
> ./2003/febqual.txt.
> 
>            Note: if this example is preceded by
> 
>            cd /n/facility/Genomics/Sequencing/
> 
>            then additional parameters will be picked up from the file
> seq-
>            tracestats.config in that directory.  This file implements
> the
>            standard read naming conventions.
> 
>        seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
> --read-
>        nameconvention='gene,primer,platenumber,platecoord,=split
> /[-_.]/'
>            Result of summarize the PHD formatted files into three bins,
> decod-
>            ing the selected attributes from the read name:
> 
>  
> trace_file_path,display_id,gene,primer,platenumber,platecoord,clone,vect
> or,FR,plate_id,platerow,platecol,length,gc_content,mean,variance,20,30,6
> 0,20,30,60
>  
> ./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13F,20,
> A01,MC1081,M13,F,20,A,01,807,0.563816604708798,39.2094175960347,118.5479
> 01273288,54,83,670,54,83,670
>  
> ./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
> A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
> 57077655,54,76,672,54,76,672
>  
> ./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
> A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
> 57077655,54,76,672,54,76,672
>             ...
> 
>        seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
> --read-
>        nameconvention='clone,vector,FR,plate_id,platerow,plate-
>        col=m/^([^\-]+)-(\w+)([FR])-(\d+)_(\w{1})(\d+)\..*$/'
>            Result of running on the same data but parsing the readname
> differ-
>            ently, including pulling out read direction and plate row and
> col-
>            umn:
> 
>  
> trace_file_path,display_id,clone,vector,FR,plate_id,platerow,platecol,le
> ngth,gc_content,mean,variance,num bp q<=20,num bp q<=30,num bp q<=60
>  
> ./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13,F,20
> ,A,01,807,0.563816604708798,39.2094175960347,118.547901273288,54,83,670
>  
> ./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13,R,48
> ,A,01,802,0.553615960099751,39.2418952618453,113.746657077655,54,76,672
>  
> ./t/data/MC1082-M13F-20_B01.phd.1,MC1082-M13F-20_B01.ab1,MC1082,M13,F,20
> ,B,01,816,0.498774509803922,29.9803921568627,105.805750030073,143,236,43
> 7
>             ...
> 
>        seqtracestats ./t/data/*.phd.1 @./eg.config
>            Process those same files, taking command line options from
> eg.con-
>            fig in the current directory.
> 
>        seqtracestats ./t/data/*.phd.1
>            Process those same file, taking command line options from
> seq-
>            tracestats.config in the current working directory.
> 
> AUTHOR
>        Malcolm Cook (mec at stowers-institute.org)
> 
> DEPENDENCIES
>        perl, BioPerl, bioperl-ext,  Time::Seconds,
> Statistics::Descriptive,
>        /usr/local/bin/find
> 
>        Time::Piece
> 
> AVAILABILITY
>        Email the author for sources.
> 
> VERSION
>        $Revision: 1.4 $
> 
> TODO
>        include support for ABI files
>        support other output formats & destinations?  tab v. csv format
> v.
>        write to (mysql?) database.  Use DBD::AnyData for implementation?
>        create textual report similar to ABI's software
>        allow for reporting on trimmed sequences
>        contibute patch to Bio::Tools::SeqStats to BioPerl
> 
> REFERENCES
>        see the unix command trev, part of staden package, for viewing
> .scf
>        files under unix
>        regarding the conversion of ABI files using SEQIO
>  
> http://open-bio.org/pipermail/bioperl-l/2003-January/010749.html -
> 
> 
> 
> perl v5.8.6                       2005-09-02
> SEQTRACESTATS(1)
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 

-- 
------------------------------------------------------------------
Andrew Walsh, M.Sc.
Bioinformatics Software Engineer
IT Unit
Cenix BioScience GmbH
Tatzberg 47
01307 Dresden
Germany
Tel. +49-351-4173 137
Fax  +49-351-4173 109

public key: http://www.cenix-bioscience.com/public_keys/walsh.gpg
------------------------------------------------------------------