[EMBOSS] vectorstrip on FASTQ files

Wed Aug 19 11:08:26 UTC 2009

Hi,

I'm trying to use vectorstrip on FASTQ files (as a simple way to
remove adaptor or primer sequences). However, it seems that on output
the FASTQ qualities are missing (all set to the double quote, ASCII
33, meaning PHRED quality 1 or random). Is this a known bug (or
rather, a missing feature)?

For illustration I am using a Sanger style FASTQ file from the NCBI
SRA (short reads originally from Solexa/Illumina), SRR014849.fastq
which you can download from
ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz

I am pretending "GTTGGAACCG" is 5' adaptor sequence, and want to find
any matches in some FASTQ reads, and trim it off taking only the
sequence to the right. For simplicity I'm allowing no mismatches.
Here is the start of the file:

$ head -n 12 SRR014849.fastq
@SRR014849.1 EIXKN4201CFU84 length=93
GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAAAGCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7F at 71,'";C?,B;?6B;:EA1EA1EA5'9B:?:#9EA0D at 2EA5':>5?:%A;A8A;?9B;D@/=<?7=9<2A8==
@SRR014849.3 EIXKN4201D4ZBL length=119
GGGGGGGGGCTGTTGGCCGAGGTTGGAGTAGCCAGGGGGAAGGCATGGCCAGCCGTTGAGAAATGCTTGTTGAAGTTTTCGATAATAATGGATTTATCGGTGGTGACCGTGTTACCTAG
+SRR014849.3 EIXKN4201D4ZBL length=119
;3.*(&$"";<=A9 at 8A9;<B;B;B;8=<==B;<FB8/'@8B:==<B;A9<<A8=B;==;A=)=<<B;=A9<@7<FB5(<<=<B;<B;:A9=EA0;<;B:<A8=<<@8<<<B;<A99=<
@SRR014849.9 EIXKN4201AL42E length=84
AACATAAAGAGCAATAGACAGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA
+SRR014849.9 EIXKN4201AL42E length=84
B:=8<EA087<;@8<<<8<:8A9=3>5B;4B>+C?,EA09B;@;9E@/EA/E@/B:;1B:B:;A9<5<B;;8EA0<<B;FB6)7

Notice the "adaptor" in in the third sequence, SRR014849.9,
AACATAAAGAGCAATAGACAGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA
This should be trimmed to just:
AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA

Using FASTA as output looks fine:

$ vectorstrip -sequence SRR014849.fastq -sformat fastq-sanger
-readfile N -alinker "GTTGGAACCG" -blinker "" -osformat fasta -outseq
SRR014849_5trimmed.fasta -mismatch 0 -besthits Y -outfile
SRR014849_5trimmed.txt
Removes vectors from the ends of nucleotide sequence(s)

$ head -n 2 SRR014849_5trimmed.fasta
>SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84
AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA

Using Sanger FASTQ runs:

$ vectorstrip -sequence SRR014849.fastq -sformat fastq-sanger
-readfile N -alinker "GTTGGAACCG" -blinker "" -osformat fastq-sanger
-outseq SRR014849_5trimmed.fastq -mismatch 0 -besthits Y -outfile
SRR014849_5trimmed.txt
Removes vectors from the ends of nucleotide sequence(s)

But the output is missing the quality scores:

$ head -n 4 SRR014849_5trimmed.fastq
@SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84
AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA
+
""""""""""""""""""""""""""""""""""""""""""""""""""""""

Is this something simple to add to vectorstrip? What about other
annotation (e.g. running vector strip on annotated GenBank or EMBL
files)?

Thanks,

Peter C.

P.S. This is with EMBOSS 6.1.0 with a patch from Peter Rice, running
on Mac OS X.