[Open-bio-l] FASTQ identifiers

Peter biopython at maubp.freeserve.co.uk
Mon Aug 3 09:30:09 UTC 2009


On Sun, Aug 2, 2009 at 2:25 AM, Charles
Plessy<charles-listes+open-bio at plessy.org> wrote:
> Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a écrit :
>> The situation is similar to the FASTA format (and others), in that there
>> are a number of reasonably well documented conventions in use
>> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
>> there are thousands of ad hoc local conventions.
>
> Hello,
>
> I just would like to mention such an ad-hoc convention in use at
> workplace: with FASTQ sequences we sometimes replace the original
> name by the sequence itself. This can be useful for instance to
> troubleshoot some sequence manipulations.
>
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ;;3;;;;;;;;;;;;7;;;;;;;88
>
> becomes:
>
> @CCCTTCTTGTCTTCAGCGTTTCTCC
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +CCCTTCTTGTCTTCAGCGTTTCTCC
> ;;3;;;;;;;;;;;;7;;;;;;;88
>

That certainly demonstrates we can't make any big assumptions
about the title line formatting ;)

Your example is interesting - but I don't quite understand why you
do this. Surely any debug message or output file for bad reads
would (normally) have a unique read ID which (indirectly) tells
you the read sequence? If you are writing the code which gives
these error messages, can't you explicitly give the read sequence?
Is the aim to be able to look at error messages from third party
tools (which just give the read name) and see the read sequence
directly (without looking up the read name in the original FASTQ
file)?

This is similar in some ways to my comment that I could see a real
use for FASTQ (and FASTA) files with no record identifiers:

>> Related to this, what about the corner case of reads with NO
>> identifier? The FASTQ (and indeed the FASTA) formats can
>> hold such things - just use a blank title line. In the case of
>> next generation sequencing reads, the names themselves
>> are not actually that important - so you can imagine a pipeline
>> which doesn't actually bother with them at all.

In your pipeline you clearly don't care about the original FASTQ
identifiers, and (if the pipeline would accept it), using blank title
lines might also work (and would certainly save disk space).

Peter




More information about the Open-Bio-l mailing list