[Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed May 5 16:22:09 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=3069





------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk  2010-05-05 12:22 EST -------
(In reply to comment #8)
> (In reply to comment #7)
> 
> > However, if the only out-of-specification thing in the IMGT EMBL files is
> > the feature indentation and long feature keys, many your original request
> > to make the EMBL parser more tolerant is the best route.
> 
> I think it will actually be a headache to do so.  Unless you want to rewrite
> the EMBL parser the way that I wrote the IMGT parser.  The only thing that
> needed changing was handling the header lines.  Once it finds an FH line, it
> uses the position of the "Location..." string to determine how indented the
> qualifiers are.


Hi Uri,

Could you retest as "embl" format with the trunk? I would expect some warnings
from these over indented features in IMGT, and we can certainly remove the
warning if we decide not to introduce a separate IMGT format variant.

http://github.com/biopython/biopython/commit/e6ba962dd60fe585baa1237445d33f67d47dd57f

This change takes a slightly different approach to your work on github, but
is quite similar to your two line patch - but this should still work with
another odd form:

FH   Key             Location/Qualifiers
FT   L-V-D-J-C-SEQUEN1..1151
FT                   /db_xref="taxon:32630"
FT                   /organism="synthetic construct"
FT   5'UTR           1..37
...

In the above example (generated by Biopython itself), the strict EMBL column
limits have been obeyed but the feature key has been truncated to just
L-V-D-J-C-SEQUEN rather than L-V-D-J-C-SEQUENCE. This is a related query -
when asked to output such a feature as EMBL or GenBank format, should we raise
an exception here? We could add a warning instead, and either leave the code
as is, or output this:

FH   Key             Location/Qualifiers
FT   L-V-D-J-C-SEQUE 1..1151
FT                   /db_xref="taxon:32630"
FT                   /organism="synthetic construct"
FT   5'UTR           1..37
...

> > Thinking ahead would you also want to be able to write out IMGT variant
> > EMBL files?
> 
> I personally don't need this functionality, but I am willing to write it to
> complement the IMGT parser that I wrote.

If we go done the route of formalising IMGT as an EMBL variant with a different
feature indent, it should just be a trivial subclass of the existing EMBL
writer object but with the indentation constant changed.

Note there are other problem in the IMGT data, including locations like
"1..428>" and "<1..328>" where the greater than should be BEFORE the location
(but we could probably cope with this all the same), and just "1." where half
the location is missing (which we can't really do much with other than treat
it as simply "1" instead?).

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list