[Biopython-dev] Sequential SFF IO

Thu Feb 3 12:04:08 UTC 2011

On Wed, Jan 26, 2011 at 7:44 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I'm currently looking at trimming 5' and 3' PCR  primer sequences -
> which could equally be used for barcodes etc. I'd probably wrap this
> as a Galaxy tool (using Biopython).
>

If anyone is interested, see this thread on the Galaxy-dev mailing list:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004290.html

In terms of SFF output, I'm only writing one SFF file so the issues
Jacob is concerned about (when writing one SFF file per barcode)
do not apply.

On Fri, Jan 28, 2011 at 12:34 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> I wrote up a barcode detector, remover and sorter for our Illumina
> reads. There is nothing especially tricky in the implementation: it
> looks for exact matches and then checks for approximate matches,
> with gaps, using pairwise2:
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
>
> The "best_match" function could be replaced with different
> implementations, using the rest of the script as scaffolding to do
> all of the other sorting, trimming and output.
>
> Brad

The computationally interesting part is matching the primer/adapter/
barcode to the read (both of which may contain IUPAC ambiguity codes),
which as you point out can be replaced once you have a working
framework for the input, output, trimming, etc.

Currently I'm using regular expressions, which is fast enough for my
own needs - and this task could easily be parallelised by breaking
up the input reads. Beyond that perhaps something based on
Hamming distances (edit distance - number of mismatches) or
Levenshtein searches might be quicker. I guess speed is more of
an issue with Illumina than with 454 due to the number of reads?

Brad - you mentioned using approximate matches with gaps. Did you
find gapped matches made a bit difference to the number of matches
found? i.e. is it worthwhile on your data?

Peter