[Biopython-dev] 454 GSFlex quality score files

Peter biopython-dev at maubp.freeserve.co.uk
Tue Oct 16 16:50:15 UTC 2007


Hi Jared,

>>> I have also needed to create a modified FASTA parser so that I can  
>>> read things like quality score files.
>>
>> Could you be a little more specific - what exactly do you mean by a
>> quality score files (links and/or examples).  It may be that this
>> warrants setting up a new file format in Bio.SeqIO
> 
> That is what I did. The quality score files I meant are simply FASTA- 
> like records that indicate the quality of each base pair read from a  
> sequencing machine, on a scale of something like 1 to 64. The values  
> are tab separated and correspond to 'reads' in another FASTA file  
> that contain the actual sequences read. This is the way the 454  
> GSFlex machines output their sequencing reads, so for every set of  
> reads there will be a pair of 454Reads.fna, 454Reads.qual files. The  
> only difference between a parser that processes these qual files and  
> one that processes the sequence files is that it shouldn't get rid of  
> spaces, and the newlines should not to be stripped but converted into  
> spaces (when 454 writes a newline of scores they omit the space).  
> Essentially I have made a duplicate of FastaIOs iterator, named it  
> something else, made these two small changes and put an entry for it  
> in the SeqIO file.

Patches and emails don't do well together.  Could you file an 
enhancement bug, and then upload your code as an attachment?  If you 
have a few examples of matched pairs of FASTA files and quality files 
which you can contribute that would be very helpful too.

It looks like you are trying to construct a "sequence" of numerical 
values (rather than a sequence of letters like nucleotides/amino acids). 
  As written I don't think it would work for element access/splicing 
etc. However, with some extra work I suppose we could stretch the Seq 
object in this way - and define a new "IntegerAlphabet".

But on balance, I don't think "lists of quality values" should be 
treated in the same way as sequences (and thus it doesn't seem to belong 
in Bio.SeqIO).

Alternatively you could regard the quality scores as sequence meta-data 
or annotation.  One idea would be to generate SeqRecord objects 
containing dummy sequences of the correct length made up of the 
ambiguous character "N", with the associated quality scores held as a 
list of integers in the SeqRecord's annotation dictionary.  Then it 
would fit into the Bio.SeqIO framework [I was thinking of something 
similar for PTT files, NCBI Protein tables, where again we have 
annotation but not the actual sequence].

Maybe there should just be a separate parser for GSFlex quality records 
  which returns iterator giving each record name with a list of 
integers. A more elegant scheme would read in the pair of files together 
(the FASTA file and the quality file) and generate nicely annotated 
SeqRecords with the sequence and the quality.  This isn't really 
possible with the Bio.SeqIO framework.

Peter



More information about the Biopython-dev mailing list