[Biopython-dev] Bio.IntelliGenetics

Peter biopython at maubp.freeserve.co.uk
Wed Jul 2 13:48:31 UTC 2008


On Wed, Jul 2, 2008 at 2:30 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Bio.IntelliGenetics contains a parser for sequence data in the IntelliGenetics format.

Just to be upfront, I'm not familiar with this format, but I've had a
look at the examples.

> In this format, each sequence has a name and comments, and in addition there can
> also be an overall comment to the file.

OK.  This is also the case in other file formats, for example GenBank
files can have free format text file header at the start but we ignore
this.

How would you separate the file header comment from the first record
comment?  Some files include what looks like a file header but the
lines all seem to start with "; ".  Maybe look for "; LOCUS..."?
Given the whole comment seems to be free format I don't think this is
very nice.

On the other hand, some of the sample inputs includes a number of
lines starting ";; Modified by ..." which would be easy to separate
(one semi colon versus two semi colons).  These are clearly file-level
header lines, rather than being part of the first record.

> Currently the parser in Bio.IntelliGenetics stores this information in
> Bio.IntelliGenetics.Record.Record objects (one record per sequence; the
> overall comment is inadvertently added to the first sequence in the file). I
> think it makes more sense to use a SeqRecord for that, and to deprecate
> Bio.IntelliGenetics.Record.Record.

If all the data extracted by the Bio.IntelliGenetics parser could be
dealt with using the SeqRecord parser added to Bio.SeqIO, then yes
deprecating Bio.IntelliGenetics sounds fine.

> In that case, Bio.SeqIO looks like a more suitable place for this parser.
> The user would see something like this:
>>>> from Bio import SeqIO
>>>> handle = open("mydatafile.txt")
>>>> records = SeqIO.parse(handle, "ig")
>>>> records.comment
> "This is the overall comment"
>>>> for record in records:
> # ... record is a SeqRecord.

I see you are using "ig" as the format name, matching EMBOSS.  Good :)
http://emboss.sourceforge.net/docs/themes/seqformats/ig

> Because of the overall comment, SeqIO.parse cannot simply return a
> generator function. It must return a full-fledged class, but one with an iterator.

Not necessarily.  We can still use a simple generator function and either throw
away the header comment, or included it with the first record (or even
with every
record).  If you did create an iterator class, would you make the
header available
as a property of the iterator?

Given the apparently fuzzy boundary between the file header and the first record
header, I would just opt to treat it all as a comment for the first
record.  And use a
simple generator function.

Peter



More information about the Biopython-dev mailing list