[Biopython] index_db or two separate indices

Peter Cock p.j.a.cock at googlemail.com
Wed Oct 19 12:58:58 UTC 2016


Hi Liam,

The simplest solution is use SeqIO.index_db twice, one for
each file, e.g.:

r1_index = SeqIO.index_db(idx_r1, "sorted_5000_R1.fq", "fastq")
r2_index = SeqIO.index_db(idx_r2, "sorted_5000_R2.fq", "fastq")

This will use two separate SQLite indexes and the memory overhead
shouldn't be a problem.

It may be possible to do it in a single index via the key_function argument
(appending either /1 or /2 to the ID depending on the filename), but right
now I can't see how to do that nicely...

Peter


On Wed, Oct 19, 2016 at 12:03 PM, Liam Thompson <dejmail at gmail.com> wrote:
> Hi everyone
>
> I'm attempting to demultiplex PE reads from an Illumina run (2 files x
> 3.5gb).
>
> I thought of creating an index_db containing both R1 and R2 reads as I need
> to pull out each pair R1 and R2 read, identify the primer+barcode sequence
> in the read sequence, and put the sequence in its designated file.
>
> My problem is that reading the files into index_db creates a problem with
> duplicate keys as the ID does not seem to include the 1 or 2 strand
> designation as found in the header (perhaps it is not stricly part of the
> header), and as the callback function only contains the ID, I can't access
> the other fields one would normally be able to with SeqRecord.
>
> index_list = SeqIO.index_db(idx_name,
> ["sorted_5000_R1.fq","sorted_5000_R2.fq"], 'fastq', generic_dna, get_record)
>
>
>
> Is it best then to just create two separate indices using SeqIO.index and
> pull out the sequences from there ? I would prefer to not have to load both
> indices into memory, though perhaps it is not as big as I think it might be.
>
> Any suggestions ?
>
> Thanks
> Liam
>
> Gothenburg, Sweden
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list