[Biopython-dev] Location Parser

Matthias Bernt MatatTHC at gmx.de
Fri Dec 21 15:18:45 UTC 2012


Dear Peter,

you are right the current RefSeq record is valid and can be parsed. In
order to reproduce old results I keep old refseq versions (of mitochondrial
genomes) on hard disk. So probably this is an old refseq bug. According to
the documentation (
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.4):


"""
Note : location operator "complement" can be used in combination with either "
join" or "order" within the same location; combinations of "join" and "order"
within the same location (nested operators) are illegal.
"""

Since this was urgent I fixed the files manually by removing the nested
files. I was not able to find a file in other RefSeq versions that can
reproduce the bug (i.e. the parser seemingly takes forever [>5min] and does
not raise an exception). You may still reproduce the bug by pasting the
location line in another GenBank file.

I agree that the desired behaviour would be a warning and skip of the
feature.

Regards,
Matthias




2012/12/21 Peter Cock <p.j.a.cock at googlemail.com>

> On Tue, Dec 18, 2012 at 12:40 PM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> > Dear list,
> >
> > I have some problems with the GenBank parser in version 1.60. Its again
> > nested location strings like:
> >
> >
> order(6867..6872,6882..6890,6906..6911,6918..6923,6930..6932,7002..7004,7047..7049,7056..7061,7068..7073,7077..7082,7086..7091,7098..7100,7119..7136,7146..7151,7158..7163,7170..7172,7179..7184,7212..7214,join(7224..7229,8194..8208),8218..8223,8245..8247,8401..8403)
> > as found in NC_003048.
>
> Do you have a URL for that? This looks OK to me:
> http://www.ncbi.nlm.nih.gov/nuccore/NC_003048.1
>
> Perhaps the entry came from the FTP site?
> e.g. one of these files?: ftp://ftp.ncbi.nih.gov/refseq/release/fungi/
>
> > What happens is that the parser stalls. It seems as if it takes forever
> to
> > parse _re_complex_compound in and never gets to the if statement that
> > checks if order and join appears in the location string.
> >
> > I suggest to move the if statement before the regular expressions are
> > tested.
> >
> > I remember that I posted something like this before. But I can not
> remember
> > how and if this was solved.
> >
> > Regards,
> > Matthaas
>
> Were similar odd locations have come up in some cases they did
> seem to be NCBI bugs - could you raise a query with the NCBI
> for this case please?
>
> If this is valid (which I doubt), then our object model doesn't cope.
>
> If this is invalid, then Biopython should give a warning and skip
> this location. Right now I can't find the file to test this (see
> query above about where it came from).
>
> Regards,
>
> Peter
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NC_001326.gb
Type: application/octet-stream
Size: 65527 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121221/911b2ce3/attachment-0002.obj>


More information about the Biopython-dev mailing list