[Biopython-dev] [Bug 2767] New: Bio.SeqIO support for FASTQ and QUAL files

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Feb 19 17:37:56 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2767

           Summary: Bio.SeqIO support for FASTQ and QUAL files
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


This is an enhancement bug for adding support to Bio.SeqIO for two commonly
used file formats for storing sequencing quality information (i.e. error
rates).

The format FASTQ (or FastQ) contains both sequences and PHREP style quality
scores.  This file format appears to have been introduced at the Sanger Centre,
but there is no official specification that I am aware of.  I would suggest for
Bio.SeqIO we call this format "fastq" (as in BioPerl). See:
http://maq.sourceforge.net/fastq.shtml
http://www.bioperl.org/wiki/FASTQ_sequence_format

Also note that Solexa/Illumina sequencers can produce FASTQ-like files which
use a different score mapping and are therefore cannot be treated in the same
way.  These would have to be treated as a different file format (e.g. Bio.SeqIO
format name "fastq-solexa" might do).

QUAL or qual files do not contain sequences but just the PHREP style quality
score.  Roche 454 sequencers also appear to use this style file (see also Bug
2382), where again I believe that PHREP style scores are used.  Because they
don't hold the actual sequence, Qual files normally come with a matching FASTA
file containing the sequence for each entry (in the same order within the
file).  I would suggest we call this the "qual" format in Bio.SeqIO (to match
BioPerl).  See:
http://www.bioperl.org/wiki/Qual_sequence_format
http://www.cees.uio.no/research/facilities/roche454/resultsfiles.html

I will attach a preliminary set of code to support this shortly.  For the
"qual" format Bio.SeqIO would return SeqRecord objects without any sequence
(perhaps as None, although we do know the sequence length...).  For both the
"qual" and "fastq" formats the SeqRecord object would need to store the PHRED
quality scores, ideally as a list of integers.

Where we put this information is open to debate.  The simple option is to just
add the list of integers to the annotation dictionary, perhaps under key name
"phred_quality" (with "solexa_quality" used when parsing a Solexa/Illumina
style FASTQ file).  This will then work with BioSQL (although the qualities
will get stored in the database as strings rather than integers).  However,
this does not facilitate slicing a SeqRecord (i.e. it would make implementing
enhancement Bug 2507 much harder).

In order to use a paired "fasta" and "qual" file you might do this:

def merge_fasta_qual(fasta_record, qual_record) :
    """Modifies the fasta_record in place, and also returns it."""
    assert fasta_record.id == qual_record.id
    assert len(f_rec) == len(q_rec.annotations["phred_quality"])
    f_rec.annotations["phred_quality"] = q_rec.annotations["phred_quality"]
    return f_rec

from Bio import SeqIO
records = [merge_fasta_qual(f_rec, q_rec) for (f_rec, q_rec) in \
           zip(SeqIO.parse(open("example.fasta"), "fasta"),
               SeqIO.parse(open("example.qual"), "qual"))]

I think it would probably make sense to offer this kind of functionality in the
Bio.SeqIO.QualityIO module itself, as this code above has several draw backs
(e.g. the zip makes a list in memory, rather than a generator).


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list