[Biopython-dev] Sequential SFF IO

Kevin Jacobs <jacobs@bioinformed.com> bioinformed at gmail.com
Fri Jan 28 12:14:39 UTC 2011

On Thu, Jan 27, 2011 at 8:32 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Wed, Jan 26, 2011 at 11:24 PM, Kevin wrote:
> > On Wed, Jan 26, 2011 at 2:44 PM, Peter wrote:
>  >> I'm currently looking at trimming 5' and 3' PCR  primer sequences -
> >> which could equally be used for barcodes etc. I'd probably wrap this
> >> as a Galaxy tool (using Biopython).
>  >
> > I have 90% of such a tool written.  I use a banded Smith-Waterman
> > alignment to match barcodes and generic PCR adapters/consensus
> > sequence to ensure that adapters and barcodes can be detected at
> > both ends of reads.
> Interesting - and yes, we do seem to have similar aims here. I have
> been doing ungapped alignments, allowing 0 or 1 (maybe in future 2)
> mismatches, working on getting this running at reasonable speed.

For just 5' barcode detection, I am using a memoized scheme that computes
anchored alignments and then stores the result in a hash table
(match/mismatch, edit distance).  This approach allows me to reject barcodes
with too small an edit distance to the next best candidate.  It is
reasonably fast for our fairly long 454 barcode set (10-'mers), though I do
have an optional Cython version of the edit distance routine.  The
pure-Python version is pretty zippy and can decode a 454 run in a minute or

> Gapped alignments would be particularly important in 454 reads
> with homopolymer errors, but most barcodes and PCR primers
> will avoid homopolymer runs so I don't expect this to be a common
> problem in this use case. Do you have good reasons to go to the
> expense of a gapped alignment?
When only trimming short 5' adapters, a gapped alignment may be a bit
overkill.  However, for our short-amplicon libraries we have a bit of a
challenge using simpler approaches.  Instead of hand-waving, here are the
gory details:  We're using Fluidigm Access Arrays to generate libraries and
sequence fragments with the following structure:

Forward: <flow key> <barcode> <consensus seq1> <target 1..n> <consensus 2
rc> <barcode rc>
Reverse: <flow key> <barcode> <consensus seq2> <target 1..n> <consensus 1
rc> <barcode rc>

where on, e.g., 454 and Ion Torrent:

  <flow key>  = TCAG
  <barcode> = non-homopolymeric 10-mer (up to 192 of them, min. edit
distance 4)
  <consensus seq1> = 22-mer generic PCR primer (distinct from cs2)
  <consensus seq2> = 22-mer generic PCR primer (distinct from cs1)
  <target> = ~30-150-mer genomic DNA target (

and "rc" denoting reverse compliment.  Our designs for Illumina are a bit
different, so I won't go into those right now.

I use the procedure outlined before to determine the barcode.  Then I
compute left-anchored gapped alignments between the read and constructs that
represent perfect matches to the targets to find the most likely boundaries
to trim the 5' and 3' elements from the target sequence.  I'm in the process
of adding position specific scoring and gap penalties, since this adds
virtually no computational cost and improves the boundary detection.  The
results go to a  genotype calling algorithm to classify known and novel

This approach is a bit overkill for some sequencing platforms with shorter
reads (e.g. Illumina or IonTorrent with 100 bp reads), but on 454 (and soon
PacBio, we hope) we routinely sequence through the targets and into the 3'
elements and have to trim.


More information about the Biopython-dev mailing list