[Open-bio-l] Naming for FASTQ example files

Peter biopython at maubp.freeserve.co.uk
Sat Aug 8 12:53:17 UTC 2009


On Thu, Aug 6, 2009 at 9:17 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I am planning on compiling a set of set FASTQ files, for use by
> Biopython, BioPerl, EMBOSS and anyone else that wants to test a
> parser. Modest size contributions will be welcome (no big files
> though).
>
> I will have two types of files: valid ones, and invalid ones. The
> basic idea is any parser should understand what we consider to be
> valid files (we may need to provide matching FASTA and QUAL files or
> something like this for verification), but also reject all the files
> we consider to be invalid.
>
> Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine?
>
> Any preference for meaningful names ("error_qual_short.fastq",
> "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq",
> "error_002.fastq", ...). Either way I think a README file would need
> to accompany the dataset stating what we think makes each example
> invalid (e.g. quality string shorted than sequence, invalid character
> in quality string, ...).

I've gone for "error_*.fastq" and have tried to use meaningful names
rather than numbers. Currently these files are only in the Biopython
repository (under biopython/Tests/Quality), but could be added to the
(currently) unused Biodata repository - although that is still on CVS:

http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html

As these examples are all small and we don't expect to change them,
I could also just email them (off the mailing list) to EMBOSS/BioPerl
people directly on request.

Currently my error examples are as follows, broken down into groups.

Quality strings with invalid ASCII characters (not the full set, but
we could do that):

error_qual_null.fastq
error_qual_vtab.fastq
error_qual_tab.fastq
error_qual_escape.fastq
error_qual_unit_sep.fastq
error_qual_space.fastq
error_qual_del.fastq

Misc errors:

error_diff_ids.fastq
error_spaces.fastq
error_tabs.fastq
error_short_qual.fastq
error_long_qual.fastq
error_no_qual.fastq

Simulated truncation part way though a file:

error_trunc_at_plus.fastq
error_trunc_at_qual.fastq
error_trunc_at_seq.fastq

Note they are all based on the same example file which due to the
quality characters can be interpreted as any of the three FASTQ
variants we're supporting (Sanger, Solexa, Illumina 1.3+). This was
deliberate. Additional examples of files which could be Sanger or
Solexa but not Illumina 1.3+ (or valid Sanger but can't be Solexa or
Illumina 1.3+) are also a good idea.

Note that in many of these examples the error is part way into the
file, so there are initially some valid reads and then an error.

Peter



More information about the Open-Bio-l mailing list