[Bioperl-l] RE: not all sequence is created equal (base quali ty d

Malcolm Cook mcook@dna.com
Thu, 28 Jun 2001 14:28:14 -0700


Hmmm....

re: "quality ... (is) typically represented in a fasta-like format such that
you have a comment line and then the data consists of space-delimited
numbers, one per corresponding base"

Phred can also now generate (instead) a single (.phd) file containing both
bases called and quality score for the base side-by-side (along with trace
position).  This happens if you use the -p option to phred, documented as:

   -p                           Write a PHD file, which is used by the
                                consed editor to display bases.  A PHD
                                file contains a set of comments used by
                                consed for maintaining consistency between
                                the chromat file, the .ace file and
                                the PHD file, and it contains base data
                                as triples consisting of the base call,
                                quality, and position.  ....

I've attached for general interest a BioPerl, Bio::SeqIO::Phd, module for
writing these file.  However, this method as I implemented it does NOT write
a seperate quality score for each base, but, simply the same value again and
again as supplied in a QUALITY_SCORE parameter to the _initialize method.
This was all I needed for my appliction (making dummy traces holding
'reference sequence' entries to assemble against in Phrap for a SNP
detection appliction).  In any case, having a single QUALITY)SCORE for the
entire sequence is more or less akin to Jason's suggestion of:

 - a SeqFeature which spanned the entire sequence and had the
   primary tag 'quality' and a value of the sequence quality.

Note also my implementation of Bio::SeqIO::Phd implements a hack to map
selected Bio::Seq features to PHD files 'TAG' structures (formated more or
less as GFF minust the locations, which appear elsewhere in the 'TAG'
record).  Doing this allows the DNA analyst to edit/QA the assembly in the
context of features specifying the reference sequence's genomic structure.

I would  like to work with anyone who cares to implement the next_seq method
for it.  I don't need it, but, they go together and would make a nice
submission to BioPerl.  Perhaps Chad Matsala would find this a good
collab???

Regards,

Malcolm

Oh, FYI, the .phd files look like this (my example has some extra tags at
end as written by polyseq, the het detection pkg):

BEGIN_SEQUENCE scn5a_ex10_102_f
BEGIN_COMMENT
CHROMAT_FILE: scn5a_ex10_102_f
ABI_THUMBPRINT: 066346015071001314340165223200
PHRED_VERSION: 0.000925.c
CALL_METHOD: phred
QUALITY_LEVELS: 99
TIME: Mon Apr  9 15:35:29 2001
TRACE_ARRAY_MIN_INDEX: 0
TRACE_ARRAY_MAX_INDEX: 5940
TRIM: 10 303 0.0500
CHEM: term
DYE: big
END_COMMENT
BEGIN_DNA
t 9 8
t 9 15
t 9 18
t 9 22
a 4 48
a 4 66

...etc etc

g 6 5933
END_DNA
END_SEQUENCE

WR{
template determineReadTypes 010409:153646
name: scn5a_ex10
type: pcr
}

WR{
primer determineReadTypes 010409:153646
type: univ fwd
}

BEGIN_TAG
TYPE: homozygoteCC 
SOURCE: polyPhred
UNPADDED_READ_POS: 209 209 
DATE: 04/09/01 15:38:42 
END_TAG

BEGIN_TAG
TYPE: dataNeeded 
SOURCE: polyPhred
UNPADDED_READ_POS: 1 10 
DATE: 04/09/01 15:38:42 
END_TAG

BEGIN_TAG
TYPE: dataNeeded 
SOURCE: polyPhred
UNPADDED_READ_POS: 311 487 
DATE: 04/09/01 15:38:42 
END_TAG