[EMBOSS] SAM format

Stephen Taylor stephen.taylor at imm.ox.ac.uk
Thu Jan 20 10:36:20 UTC 2011


On 20/01/2011 09:36, Peter Rice wrote:
> On 01/20/11 09:06, Stephen Taylor wrote:
>> Hi Peter,
>>
>>>>
>>>> Is EMBOSS planning to release tools that produce SAM format in the near
>>>> future or is it more likely to be on the customary July 15th release?
>>>
>>> The last release EMBOSS 6.3.1 has SAM as an output format for sequences
>>> and pairwise alignments (-oformat sam and -aformat sam respectively).
>
> ... oops, -osformat for sequences of course.
>
>
>> Sadly, fuzznuc doesn't seem to work using aformat or oformat. Is that
>> due to be supported?
>
> Ah, fuzznuc reports features so we hadn't implemented SAM there.
>
> However, you can use -rformat listfile to get USAs for the features, and
> then seqret -osformat sam @listfilename to get the sequences in SAM format.
>
> Possibly scope to do more there. What would you like to see in SAM
> output for fuzznuc?

My motivation was to build BAM tracks showing matches of lots of patterns in the genome sequence. I hadn't thought about 
proteins but I guess you could so something similar.

The SAM file would show the position of each match per line and the CIGAR string containing the matched pattern and SEQ 
(col 10) containing the query pattern expanded to show the match. The original pattern could be in the OPT field. I see 
there is a tag for Mismatching positions (MD) which would work for regex style matches (so good for 'dreg'), but I am 
not sure it would be strictly legal for a PROSITE like pattern.

e.g for [CG](5)TG{A}N(1,5)C

Could you have

MD:Z:[CG](5)TG{A}N(1,5)C

?

It looks like {,} is not allowed. So perhaps you would have to translate the pattern to a regex or generate an 
alternative optional tag. I am not a SAM expert so apologies if I am proposing to violate the format rules!

Incidentally, I would use dreg but it doesn't allow mismatches to be easily specified.

Steve







More information about the EMBOSS mailing list