[emboss-dev] Quality scores in union and splitter output (was: partial GenBank)

Peter biopython at maubp.freeserve.co.uk
Thu Jun 17 17:58:03 UTC 2010


Hi Peter R et al,

This email was prompted by a discussion on the main EMBOSS list:

On Thu, Jun 17, 2010 at 9:38 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> On 16/06/2010 13:55, Roi Brodo wrote:
>>
>> After some more reading I think I can do it using union. The problem is
>> that after I create the list (of the two ranges) using yank, union dies on
>> "union terminated: Bad value for '-sequence' and no prompt". Why is that?
>> Shouldn't i use a yank file?
>
> Yes, yank and union is the correct approach.
>
> The output of yank is a list file, so the input to union should be @filename
> to read a list of sequence addresses from the file.
>
> If you just give the filename it assumes it is sequences (perhaps a fasta
> file of sequences to be joined).
>
> We will add this to our feature requests - it should be possible to make
> seqret handle ranges from circular sequences. This will be after the next
> release as it requires rewriting the way several library functions work to
> allow the circular range.

This is an interesting example and I couldn't resist working out how I
would solve it with Biopython - something like this if anyone cares:

from Bio import SeeqIO
old = SeqIO.read("input.gbk", "gb")
new = old[800000:] + old[:100000]
SeqIO.write(new, "output.gbk", "gb")

We already have several unit tests for Biopython which check some
functionality against an EMBOSS tool. I should probably try using
EMBOSS yank and union to verify our record slicing and addition...
As part of this it occurred to me to try union on FASTQ files and I
found it does not handle the quality scores properly:

$ more example.fastq
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+
;;;;;;;;;;;9;7;;.7;393333


$ union --version
EMBOSS:6.2.0

$ union -sequence example.fastq -sformat fastq-sanger -osformat
fastq-sanger -stdout -auto
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCCTTGGCAGGCCAAGGCCGATGGATCAGTTGCTTCTGGCGTGGGTGGGGGGG
+
;;3;;;;;;;;;;;;7;;;;;;;88!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The output is well formed, and has the correct quality scores
from the first entry, but has failed to include the quality scores
of the subsequence sequences (defaulting to PHRED zero,
the exclamation mark).

There is a similar issue with splitter which seems to default
to PHRED one, the double quote, for all entries:

$ splitter -size 5 -sequence example.fastq -sformat fastq-sanger
-osformat fastq-sanger -stdout -auto
@EAS54_6_R1_2_1_413_324_1-5
CCCTT
+
"""""
@EAS54_6_R1_2_1_413_324_6-10
CTTGT
+
"""""
...
[cut]

Now admittedly neither of these operations seem to be very
natural for short read data - although it might make sense
to take a FASTQ contig and shred it using splitter for
feeding into another assembly tool?

Anyway, I thought I should report these issues.

Regards,

Peter C.



More information about the emboss-dev mailing list