[Biopython-dev] [Bug 3062] GenBank/EMBL parser breaks on over-indented features

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Tue Apr 27 09:43:14 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=3062


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
            Summary|GenBank/EMBL parser breaks  |GenBank/EMBL parser breaks
                   |when features have no       |on over-indented features
                   |qualifiers                  |




------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2010-04-27 05:43 EST -------
(In reply to comment #4)
> I did something stupid, and uploaded the wrong IMGT record.  I will upload the
> actual offending record.  However, after stepping through the code with pdb,
> it appears that the problem with the offending record is that the feature
> qualifiers are indented too far, so that the whitespace is not fully stripped
> off.

Thanks for checking and working out what was wrong. Yes, this file does indeed
break.

> Has it ever been considered to parse the features by breaking the line with
> split(), instead of hardcoding the number of columns?  While the official EMBL
> specification may hardcode the size of the fields, the parse may be more
> robust to such errors.  (Though I understand the desire to conform exactly to
> EMBL standards).  Eitherway, I will notify the curators of the IMGT database.

Please do contact the IMGT curators.

(In reply to comment #6)
> Alternatively, an additional lstrip() call for each line in lines in
> parse_feature() would probably also solve the problem.  What are reasons not
> to do this?

Trying to parse out-of-spec files is a potential nightmare. We do try and be
tolerant of "quirks" in official NCBI or EMBL files (which are occasionally
technically invalid), as long as such corrections look easy and unambiguous.

In this particular case, we can cope with the extra indentation as you suggest
by stripping any leading white space.

Fixed in the repository:
http://github.com/biopython/biopython/commit/73caa4072898e7d5a71d38138c9e053066f11b24

Thank you Uri,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list