[Biopython] SeqIO.parse for imgt

Peter Cock p.j.a.cock at googlemail.com
Fri Nov 4 16:17:31 UTC 2016


Hello Chang,

It looks like the IMGT file format has changed slightly, and someone
may need to modify the parser code to cope with this.

As you said, I could parse this file fine with the current version of Biopython:

$ curl -L -O https://github.com/ANHIG/IMGTHLA/raw/3160/hla.dat

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> for r in SeqIO.parse("hla.dat", "imgt"): print(r.id)
...
HLA00001
HLA02169
HLA01244

...
HLA02801
HLA02802
HLA02803

I can confirm the latest file is a problem:

$ curl -L -O https://github.com/ANHIG/IMGTHLA/raw/Latest/hla.dat

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> for r in SeqIO.parse("hla.dat", "imgt"): print(r.id)
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/Bio/SeqIO/__init__.py", line
600, in parse
    for r in i:
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 479, in parse_records
    record = self.parse(handle, do_features)
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 463, in parse
    if self.feed(handle, consumer, do_features):
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 430, in feed
    self._feed_first_line(consumer, self.line)
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 633, in _feed_first_line
    raise ValueError('Did not recognise the ID line layout:\n' + line)
ValueError: Did not recognise the ID line layout:
ID   HLA00001; SV 1; standard; DNA; HUM; 3503 BP.

Technically, the Biopython changes are most likely to be in
Bio/GenBank/Scanner.py to class _ImgtScanner, although if
recent EMBL format files have also changed we may just need
to update class EmblScanner only. Specifically I would think
EMBL method _feed_first_line needs updating, or a new
IMGT specific  _feed_first_line needs defining.

I'm not familiar with IPD - IMGT/HLA, so if you have any more
information their release 3.16.0 and what was changed, it would
be very helpful. Especially if this is linked to EMBL changes.

Thanks,

Peter

On Fri, Nov 4, 2016 at 3:30 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
> Hi, everyone,
>
> I am new to this mail list, so please bear with my ignorance.
>
> I am using SeqIO to parse the hla.dat file from the IMGT/HLA database
> (https://github.com/ANHIG/IMGTHLA/tree/3160):
>
> Handle='hla.dat'
>
> records=SeqIO.parse(handle, 'imgt')
>
> The code only works for files up to version 3.16.0, but not any data files
> after that. The following was raised:
>
> ValueError: Did not recognise the ID line layout:
>
> ID   HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
>
> Apparently the format has changed in the data file, which looks like this
> for the ID line before 3.16.0:
>
> ID   HLA00001   standard; DNA; HUM; 3503 BP.
>
> Could someone tell me how the module can be updated to parse current and
> future data files. Thank you so much!!
>
> Chang
>
>
>
>
>
> ________________________________
>
> The materials in this message are private and may contain Protected
> Healthcare Information or other information of a sensitive nature. If you
> are not the intended recipient, be advised that any unauthorized use,
> disclosure, copying or the taking of any action in reliance on the contents
> of this information is strictly prohibited. If you have received this email
> in error, please immediately notify the sender via telephone or return mail.
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list