[Open-bio-l] FASTQ identifiers

Charles Plessy charles-listes+open-bio at plessy.org
Sun Aug 2 01:25:37 UTC 2009


Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a écrit :
> The situation is similar to the FASTA format (and others), in that there
> are a number of reasonably well documented conventions in use
> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
> there are thousands of ad hoc local conventions.

Hello,

I just would like to mention such an ad-hoc convention in use at workplace:
with FASTQ sequences we sometimes replace the original name by the sequence
itself. This can be useful for instance to troubleshoot some sequence
manipulations.

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
;;3;;;;;;;;;;;;7;;;;;;;88

becomes:

@CCCTTCTTGTCTTCAGCGTTTCTCC
CCCTTCTTGTCTTCAGCGTTTCTCC
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;3;;;;;;;;;;;;7;;;;;;;88

and after some arbitrary trimming at the ends:

@CCCTTCTTGTCTTCAGCGTTTCTCC
TTCTTGTCTTCAGCGTTTCT
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;;;;;;;;;;;7;;;;;;;


With FASTA format, we sometimes eliminate redundant sequences and record how
many times they occurred by adding the count to the name.

For instance:

>seq1
AAATTT
>seq2
AAATAT
>seq3
AAATTT

becomes:

>AAATTT_2
AAATTT
>AAATAT_1
AAATAT

If this is popular elsewhere, it may be useful to implement functions that
allow doing this efficiently.

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan



More information about the Open-Bio-l mailing list