[EMBOSS] SAM format

Peter Rice pmr at ebi.ac.uk
Thu Jan 20 10:54:33 UTC 2011


On 01/20/11 10:36, Stephen Taylor wrote:
> On 20/01/2011 09:36, Peter Rice wrote:
>> Possibly scope to do more there. What would you like to see in SAM
>> output for fuzznuc?
>
> My motivation was to build BAM tracks showing matches of lots of
> patterns in the genome sequence. I hadn't thought about proteins but I
> guess you could so something similar.
>
> The SAM file would show the position of each match per line and the
> CIGAR string containing the matched pattern and SEQ (col 10) containing
> the query pattern expanded to show the match. The original pattern could
> be in the OPT field.

Interesting. The sequence becomes the reference. We would have to do a 
little extra work to generate the CIGAR string for various patterns but 
that should be possible but modifying the pattern matching code.

>I see there is a tag for Mismatching positions (MD)
> which would work for regex style matches (so good for 'dreg'), but I am
> not sure it would be strictly legal for a PROSITE like pattern.
>
> e.g for [CG](5)TG{A}N(1,5)C
>
> Could you have
>
> MD:Z:[CG](5)TG{A}N(1,5)C

That will need some investigation. Maybe prosite patterns can be 
translated to regex for this purpose - many will convert easily.

> It looks like {,} is not allowed. So perhaps you would have to translate
> the pattern to a regex or generate an alternative optional tag. I am not
> a SAM expert so apologies if I am proposing to violate the format rules!

N(1,5) is equivalent to NN?N?N?N? ... though prosite ranges can go over 
100 positions.

> Incidentally, I would use dreg but it doesn't allow mismatches to be
> easily specified.

True, that's a regular expression library issue.

regards,

Peter



More information about the EMBOSS mailing list