[BioPython] Uniprot Parser

Ruchira Datta ruchira.datta at gmail.com
Sun Feb 24 16:28:33 UTC 2008


On Sun, Feb 24, 2008 at 5:06 AM, Peter <biopython at maubp.freeserve.co.uk>
wrote:

> On Sat, Feb 23, 2008 at 10:44 PM, Ruchira Datta <ruchira.datta at gmail.com>
> wrote:
> > I've been using Bio.SwissProt.SProt to parse this file.  The only glitch
> >  that came up so far is that when some fields span multiple lines (e.g.,
> OS,
> >  the species field), SProt puts a newline in the field.  This is not
> >  correct--it should be just a blank space.  However, this can easily be
> >  corrected within SProt itself without requiring a forked parser.
>
> I'm guessing you are using the parser to return Record objects, which
> are a fairly simple direct mapping of the raw file format - and I can
> understand why the newlines were included.  If you use the parser to
> get SeqRecord objects (which are generic and not tied to the
> SwissProt/UniProt format), then the newlines are removed.
>

Hi Peter,

I had tried SeqRecord first, but it didn't include the references, which I
absolutely need.

While inclusion of newlines may be understandable, it's a bug.  The newline
is stripped
from several other fields by _RecordConsumer, e.g.,

    def reference_number(self, line):
        rn = line[5:].rstrip()
        ...

and it needs to be stripped from this one, instead of

    def organism_species(self, line):
        self.data.organism += line[5:]

The newlines are never significant in any field.

In a couple of weeks I might be able to check out the cvs
version and provide a patch.

--Ruchira

>
> >  At least two other parsers for this file have been written by people in
> my
> >  group, but I have pushed and implemented standardization on the
> BioPython
> >  one.  Part of the point of BioPython is to have one central repository
> for
> >  development and maintenance of things like this, so that hundreds of
> people
> >  don't have to spend their time reinventing the wheel.  It is much
> preferable
> >  that people contribute changes rather than creating a forked version.
> >
> >  --Ruchira
>



More information about the Biopython mailing list