[emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Peter biopython at maubp.freeserve.co.uk
Fri Aug 13 09:40:35 UTC 2010


On Tue, Aug 3, 2010 at 9:12 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>>
>>> Just for the record, EMBOSS 6.3.1 does not append anything to the
>>> read names, meaning paired end reads cannot be distinguished if
>>> output as FASTA or FASTQ.
>>>
>>> I'm not sure my idea of appending /1 or /2 for paired reads is the
>>> best solution (especially since there are other naming schemes
>>> out there like _f and _r as suffixes). Nevertheless, it seems like a
>>> practical solution. Would including a slash character within a
>>> sequence name cause problems in EMBOSS (a potential issue
>>> you raised earlier)?
>>
>> The /1 and /2 would cause horrible problems. The sequence names are
>> used to generate default output file names so a '/' would have to be
>> removed or converted, most likely to _1 and _2
>
> Oh :(
>
> I thought they might cause confusion with slashes in filenames, but
> yes, they can't be used in filenames can they.

Thinking about this more, I don't think there is a problem. There are
two main reasons. First, with SAM/BAM/FASTQ files there are typically
so many reads that you would never want to create one file per read.

Also, there are plenty of other file formats where the record ID can
or indeed usually does contain a slash - specifically PFAM/Stockholm
format alignments from PFAM where the ID is name/start-stop, e.g.
http://emboss.sourceforge.net/docs/themes/seqformats/pfam
Surely EMBOSS has already got a mechanism for dealing with
slashes in IDs when asked to use the IDs as filenames?

I think I mentioned storing the original read name in the tags had
been suggested on the samtools-devel list. In the latest draft of
the SAM/BAM spec, a new tag FS (fragment name suffix) has been
proposed, so that the original read names could be recovered by
taking the fragment name (the ID in SAM/BAM) and appending
this suffix. See this thread earlier in August 2010,

[Samtools-devel] Recording original read name in tags
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTimg%2BvNU3CkW-63Mmug-Qt0md183dyJ_nRqva1rv%40mail.gmail.com&forum_name=samtools-devel

Finally, also on the samtools-help list, it was pointed out that the
hydra-sv project has a bamToFastq tool, see thread:

[Samtools-help] BAM to fastq how?
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTinBnm%2B8V8bXD_ii9jn8-O%2B0_N1MgWBxBFnqm2Mk%40mail.gmail.com&forum_name=samtools-help

and http://code.google.com/p/hydra-sv/

Peter C.



More information about the emboss-dev mailing list