[Bioperl-l] How to assess sequencing quality with Bioperl?
Andrew Walsh
walsh at cenix-bioscience.com
Fri Sep 2 11:34:07 EDT 2005
There is also a program from TIGR called 'lucy' that can be used to
process the output from phred. The Bioperl module for dealing with lucy
files is Bio::Tools::Lucy. If you use it, make sure to check out the
most recent version in CVS, as a bug was recently fixed.
lucy:
http://www.tigr.org/software/sequencing.shtml
Andrew
Cook, Malcolm wrote:
> Andrew,
>
> Following is a manpage for a program I wrote using Bioperl (and phred).
>
> If you'd like it, let me know, I'll slap our institute approved software
> license on it an put a `make install`able tarball on a webpage
> somewhere...
>
> Malcolm Cook - mec at stowers-institute.org - 816-926-4449
> Database Applications Manager - Bioinformatics
> Stowers Institute for Medical Research - Kansas City, MO USA
>
> ________________________________________________________________________
> _________
>
> SEQTRACESTATS(1) User Contributed Perl Documentation
> SEQTRACESTATS(1)
>
>
>
> SYNOPSIS
> seqtracestats --bins=max1,max2,max3 --informat trace_format
> --readname-
> convention attributelist=expression --mtime daycount --abi2phd <
> trace
> files > quality score summary
>
> Given a set of DNA sequence traces (or possibly a directory to
> find
> them in based on their age), produce summary descriptive
> statistics of
> the read quality (after optionally (re)base calling using phred).
>
> OPTIONS
> Their case is ignored, and they may be abbreviated to uniqueness
> (i.e.
> --v instead of --verbose).
>
> Options may be specified on the command line, and may optionally
> also
> be read from files by providing on the command line the path to
> the
> file preceded by a '@'. These option files provide simple access
> to
> typical calling scenarios (such as an analysis that is repeatedly
> invoked from the command line with the same parameters).
> Additionally,
> if the current directory contains a file named
> seqtracestats.config, it
> will be automatically used as an option file.
>
> --informat = PHD|SCF|ABI
> Format of input files. Defaults to PHD. Can be any
> Bio::SeqIO
> format that contains quality values, such as SCF.
>
> TODO: currently only PHD works. Get the staden integration
> for ABI
> and SCF files working with BioPerl.
>
> --bins = n,m,o,p
> A comma separated list of numbers which should be in
> increasing
> order. They define the ranges into which the sequence
> quality
> score for each base-pair read is binned; each number sets the
> maxi-
> mum allowable value in the bin, with the minimum value being
> deter-
> mined by the previous bin, if any.
>
> Note that any quality scores encountered which are greater
> than the
> largest bin will not be included in any bin. Otherwise, the
> sum of
> the values in each of the bins will equal the entire sequence
> length.
>
> I have not encountered any documentation specifying a maximum
> phred
> score assigned in practice by any base caller. However, I
> have
> never encountered a score, or a report of a score, greater
> than 50.
> So, you can pretty safely use 60 as your highest bin, which
> corre-
> sponds to an error probability of one in a million.
>
> Defaults to --bins=20,30,40,50,60.
>
> --mtimeatt timeattributes
> Allows specification of time attributes to appear in report.
>
> --readnameconvention attributelist=expression
> A method of decoding attributes such as are typically encoded
> in
> the name of files holding seqreads.
>
> Implemented as an association between a comma delimited list
> of
> attribute names and a perl expression which, when evaluated
> yeilds
> corresponding values for each attribute. The expression is
> evalu-
> ated in the context of $_ being the filename holding a
> seqread.
>
> Any attributes containing the word 'ignore' as a substring
> are
> supressed from the summary.
>
> For example:
>
> "--readnameconvention='gene,library,plate,primer,platewellco-
> ord,=split /[-_.]/'"
>
> when reading the file: VPL3-15-I-BSB460_A01.seq
>
> will assign to the current read attributes as follows:
>
> gene => VPL3
> library => 15
> plate => I
> primer => BSB460
> platewellcoord => A01
>
> Named aliases exist as shorthand methods for decoding
> commonly used
> naming conventions. The following are predefined:
>
> TODO: IMPLEMENT THESE - THEY ARE NOT.
>
> simr1
> Short for:
> '(gene,library,plate,primer,platewellcoord)=split
> /[-_.]/'
>
> --mtime
> Interpret the input files as directorys to search for ab1
> files
> created (modified) mtime days ago. If none supplied,
> defaults to
> current working directory.
>
> This value is used as --mtime argument to the unix find
> command
> (along with -daystart), and as such may be prefixed with '-'
> or
> '+', to mean 'less than' or 'greater than' mtime days ago.
>
> May be specified multiple times to produce data ranges. I.e.
> --mtime -31 --mtime +0 will select files between 1 and 30
> days old
> (inclusive)
>
> Implies --abi2phd
>
> --abi2phd
> Input files are ab1 chromatogram files and they should be
> replaced
> with corresponding quality file, generated using phred as
> needed.
>
> If there is an existing corresponding .phd (or .phd.1) file,
> it
> will be used.
>
> Otherwise, the ab1 files will be (re) base-called by phred,
> gener-
> ating sequence quality files in the same directory with
> addition
> extension of ".phd".
>
> Note: if the input files already have corresponding '.phd' or
> '.phd.1' file, it will NOT be re-basecalled and the existing
> QV
> scores will be used.
>
> --help
> Display command line usage with options.
>
> --man
> Display complete manual page and exit.
>
> --verbose
> Provides a trace of processing on STDERR.
>
> --version
> Display the scripts version number and exit.
>
> DESCRIPTION
> Output is in csv format (comma separated values) and may be
> visualied
> in Excel.
>
> TODO: Elaborate this description?
>
> EXAMPLES
> seqtracestats --mtime 1 > seqtracestats.txt
> Summarize the result of run phred on all ab1 files created
> yester-
> day which are subordinate to current directory, putting the
> results
> in file seqtracestats.txt
>
> ./seqtracestats --mtime -10 '/n/facility/Genomics/Sequenc-
> ing/2003/Molecular\ Biology/' --readname 'platerow,plate-
> col=m/_([A-H])(\d{2})\./' --readname
> ',,submitter,lab=reverse(split
> /\//)'
> Summarize reads less than 10 days old, parsing out platerow
> and
> plate col from the read filename, and interpreting selected
> direc-
> tory components as the submitter of the sequencing request,
> and the
> lab of which s/he is a member.
>
> seqtracestats --mtime +1,-30 ./2003 > ./2003/febqual.txt
> Summarize trace files that are less than 30 but more than 1
> days
> old which are within the directory ./2003 which is in turn
> within
> the current directory. Put the results in
> ./2003/febqual.txt.
>
> Note: if this example is preceded by
>
> cd /n/facility/Genomics/Sequencing/
>
> then additional parameters will be picked up from the file
> seq-
> tracestats.config in that directory. This file implements
> the
> standard read naming conventions.
>
> seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
> --read-
> nameconvention='gene,primer,platenumber,platecoord,=split
> /[-_.]/'
> Result of summarize the PHD formatted files into three bins,
> decod-
> ing the selected attributes from the read name:
>
>
> trace_file_path,display_id,gene,primer,platenumber,platecoord,clone,vect
> or,FR,plate_id,platerow,platecol,length,gc_content,mean,variance,20,30,6
> 0,20,30,60
>
> ./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13F,20,
> A01,MC1081,M13,F,20,A,01,807,0.563816604708798,39.2094175960347,118.5479
> 01273288,54,83,670,54,83,670
>
> ./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
> A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
> 57077655,54,76,672,54,76,672
>
> ./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13R,48,
> A01,MC1081,M13,R,48,A,01,802,0.553615960099751,39.2418952618453,113.7466
> 57077655,54,76,672,54,76,672
> ...
>
> seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
> --read-
> nameconvention='clone,vector,FR,plate_id,platerow,plate-
> col=m/^([^\-]+)-(\w+)([FR])-(\d+)_(\w{1})(\d+)\..*$/'
> Result of running on the same data but parsing the readname
> differ-
> ently, including pulling out read direction and plate row and
> col-
> umn:
>
>
> trace_file_path,display_id,clone,vector,FR,plate_id,platerow,platecol,le
> ngth,gc_content,mean,variance,num bp q<=20,num bp q<=30,num bp q<=60
>
> ./t/data/MC1081-M13F-20_A01.phd.1,MC1081-M13F-20_A01.ab1,MC1081,M13,F,20
> ,A,01,807,0.563816604708798,39.2094175960347,118.547901273288,54,83,670
>
> ./t/data/MC1081-M13R-48_A01.phd.1,MC1081-M13R-48_A01.ab1,MC1081,M13,R,48
> ,A,01,802,0.553615960099751,39.2418952618453,113.746657077655,54,76,672
>
> ./t/data/MC1082-M13F-20_B01.phd.1,MC1082-M13F-20_B01.ab1,MC1082,M13,F,20
> ,B,01,816,0.498774509803922,29.9803921568627,105.805750030073,143,236,43
> 7
> ...
>
> seqtracestats ./t/data/*.phd.1 @./eg.config
> Process those same files, taking command line options from
> eg.con-
> fig in the current directory.
>
> seqtracestats ./t/data/*.phd.1
> Process those same file, taking command line options from
> seq-
> tracestats.config in the current working directory.
>
> AUTHOR
> Malcolm Cook (mec at stowers-institute.org)
>
> DEPENDENCIES
> perl, BioPerl, bioperl-ext, Time::Seconds,
> Statistics::Descriptive,
> /usr/local/bin/find
>
> Time::Piece
>
> AVAILABILITY
> Email the author for sources.
>
> VERSION
> $Revision: 1.4 $
>
> TODO
> include support for ABI files
> support other output formats & destinations? tab v. csv format
> v.
> write to (mysql?) database. Use DBD::AnyData for implementation?
> create textual report similar to ABI's software
> allow for reporting on trimmed sequences
> contibute patch to Bio::Tools::SeqStats to BioPerl
>
> REFERENCES
> see the unix command trev, part of staden package, for viewing
> .scf
> files under unix
> regarding the conversion of ABI files using SEQIO
>
> http://open-bio.org/pipermail/bioperl-l/2003-January/010749.html -
>
>
>
> perl v5.8.6 2005-09-02
> SEQTRACESTATS(1)
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
------------------------------------------------------------------
Andrew Walsh, M.Sc.
Bioinformatics Software Engineer
IT Unit
Cenix BioScience GmbH
Tatzberg 47
01307 Dresden
Germany
Tel. +49-351-4173 137
Fax +49-351-4173 109
public key: http://www.cenix-bioscience.com/public_keys/walsh.gpg
------------------------------------------------------------------
More information about the Bioperl-l
mailing list