[Biopython-dev] Bio.SeqIO & Bio.SwissProt; comment lines

Sun Jun 7 12:11:06 UTC 2009

On Sun, Jun 7, 2009 at 12:38 PM, Michiel de Hoon<mjldehoon at yahoo.com> wrote:
>
> Hi everybody,
>
> Comments in SwissProt files such as the following: ...
> are currently being stored differently by Bio.SeqIO and Bio.SwissProt.
>
> Bio.SeqIO stores the comments as one string, as follows: ...
> Note that two endlines appear at the end of each line; I don't know why.

The double new lines sound like a bug to me, we should fix that.

> Bio.SwissProt, on the other hand, stores a list of comments (with
> single newlines): ...

That's just a list containing one string in your example.

> I think that the approach used by Bio.SwissProt is more reasonable,
> although I'd prefer to remove the newlines and to skip the copyright
> statement altogether (since it's the same for all SwissProt records
> anyway).

In the long term, it looks like the new SwissProt comments are
structured in a way that would allow automatic parsing to extract the
data.

> Can we do the same for Bio.SeqIO? Or is there a need to keep
> record.annotations['comments'] as a single string? If they are
> kept as a single string, how about using a single newline between
> comments, and no newlines within comments?

I think there are reasons to keep record.annotations['comments'] as a
single string. The GenBank SeqRecord parser (called from Bio.SeqIO)
also uses a single string for comments (not a list of strings), so the
old SwissProt SeqRecord parser (and thus Bio.SeqIO) is consistent with
that. I'd also have to check if switching to a list of strings would
be OK with the BioSQL code. Finally, such a change would not be
backwards compatible and could break existing scripts.

> This btw is the last inconsistency between Bio.SeqIO and
> Bio.SwissProt. By making this consistent, Bio.SeqIO could
> use Bio.SwissProt as a backend, which is about three times
> faster than the current parser, and has the added benefit
> of having to maintain only one SwissProt parser.

Three times faster sounds very good - assuming it can parse all our
existing unit tests of course ;)

We don't actually need to change the way comments are stored in the
SeqRecord for this parser. I understood your plan is to build a new
Bio.SeqIO SwissProt parser on top of the new Bio.SwissProt record
based parser, by converting the SwissProt records into SeqRecord
objects. At this step, simply concatenate the list of comment strings
into one string for the SeqRecord.

Then we can use the new faster Bio.SwissProt parser within Bio.SeqIO,
without breaking backwards compatibility, and deprecate the old
Bio.SwissProt.SProt parser :)

Peter