[Biopython-dev] Getting raw unparsed records with SeqIO?

Peter biopython at maubp.freeserve.co.uk
Tue Feb 2 18:37:25 UTC 2010


Hi all,

Over on enhancement Bug 3000, Martin was asking about
getting raw unparsed strings for each record in a sequence file:
http://bugzilla.open-bio.org/show_bug.cgi?id=3000

This makes sense for sequential files like FASTA and GenBank,
but not for interlaced files like PHYLIP, and has less obvious
uses when there is any kind of header or footer (e.g. XML or
SFF files).

The particular example Martin gave was selecting a subset of
records in a large GenBank file (I've done this myself in the past).
While this can be done via Bio.SeqIO, the process of parsing
the data into a SeqRecord and saving it again is lossy. While
there is room for improvement. For this particular example, I
suggested Martin use the "old" iterator class in Bio.GenBank.

In general things like white space and wrapping mean that a
SeqIO parse/write cannot guarantee a 100% unaltered round
trip, and will also be slower than using the raw record as a string.

Martin suggested adding an optional argument to the parse
function. I'm not sure this is a good API choice, as it would
dramatically alter the return values. Perhaps we could have
a new iterator function in Bio.SeqIO for suitable sequential
files only which returns a series of strings, one for each
record, unmodified?

Either way I don't see how this would be used - surely
the user would need to do some basic analysis of each
raw record to decide how to process it? In this example,
they would need to extract the ID/accession to see if they
want to output the record or not. While parsing the record
into a SeqRecord may not be needed, in most cases the
record identifier would be very useful - and this has some
big overlaps with the Bio.SeqIO.index() code which already
breaks up files into records and extracts their identifiers.

i.e. A top level Bio.SeqIO function to iterate over a file
returning tuples of the record identifier and the raw
record as strings *could* be useful. Implementing this
nicely would mean re-factoring Bio.SeqIO.index()
extensively.

Another solution to this task (extracting the raw GenBank
records from a large file) would seem to be to extend the
Bio.SeqIO.index functionality. The patch I'm about to
attach to Bug 3000 adds a new "get_raw" method to the
dictionary like object we return. Unlike the __getitem__
and get methods which return a SeqRecord this just gives
the raw string.

Note that I haven't implemented this for all the index
support file formats yet, and this has had only very basic
testing. Writing this email took longer than writing the
code. However, I hope it illustrates the idea enough for a
discussion. As an example how the index function could
be used with this patch:

>>> from Bio import SeqIO
>>> data = SeqIO.index("cor6_6.gb", "gb")
>>> data.keys()
['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
>>> print data.get_raw("X62281.1")
LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
DEFINITION  A.thaliana kin2 gene.
ACCESSION   X62281
...
//

What are people's thoughts on this?

Peter



More information about the Biopython-dev mailing list