[Biopython] index_db or two separate indices
Liam Thompson
dejmail at gmail.com
Wed Oct 19 11:03:28 UTC 2016
Hi everyone
I'm attempting to demultiplex PE reads from an Illumina run (2 files x
3.5gb).
I thought of creating an index_db containing both R1 and R2 reads as I
need to pull out each pair R1 and R2 read, identify the primer+barcode
sequence in the read sequence, and put the sequence in its designated file.
My problem is that reading the files into index_db creates a problem
with duplicate keys as the ID does not seem to include the 1 or 2 strand
designation as found in the header (perhaps it is not stricly part of
the header), and as the callback function only contains the ID, I can't
access the other fields one would normally be able to with SeqRecord.
index_list = SeqIO.index_db(idx_name,
["sorted_5000_R1.fq","sorted_5000_R2.fq"], 'fastq', generic_dna, get_record)
Is it best then to just create two separate indices using SeqIO.index
and pull out the sequences from there ? I would prefer to not have to
load both indices into memory, though perhaps it is not as big as I
think it might be.
Any suggestions ?
Thanks
Liam
Gothenburg, Sweden
More information about the Biopython
mailing list