[Open-bio-l] FASTQ identifiers

Sun Aug 2 01:25:37 UTC 2009

Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a écrit :
> The situation is similar to the FASTA format (and others), in that there
> are a number of reasonably well documented conventions in use
> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
> there are thousands of ad hoc local conventions.

Hello,

I just would like to mention such an ad-hoc convention in use at workplace:
with FASTQ sequences we sometimes replace the original name by the sequence
itself. This can be useful for instance to troubleshoot some sequence
manipulations.

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
;;3;;;;;;;;;;;;7;;;;;;;88

becomes:

@CCCTTCTTGTCTTCAGCGTTTCTCC
CCCTTCTTGTCTTCAGCGTTTCTCC
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;3;;;;;;;;;;;;7;;;;;;;88

and after some arbitrary trimming at the ends:

@CCCTTCTTGTCTTCAGCGTTTCTCC
TTCTTGTCTTCAGCGTTTCT
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;;;;;;;;;;;7;;;;;;;

With FASTA format, we sometimes eliminate redundant sequences and record how
many times they occurred by adding the count to the name.

For instance:

>seq1
AAATTT
>seq2
AAATAT
>seq3
AAATTT

becomes:

>AAATTT_2
AAATTT
>AAATAT_1
AAATAT

If this is popular elsewhere, it may be useful to implement functions that
allow doing this efficiently.

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan