[BioPython] Cannot parse GenBank file
Martin MOKREJŠ
mmokrejs at ribosome.natur.cuni.cz
Thu Jun 7 14:26:44 UTC 2007
Hi,
Chris Fields wrote:
> One thing I missed which explains the biopython error: the LOCUS line is
> missing the locus identifier (see the NCBI example record link). This
> doesn't choke the bioperl parser but it appears to stop the biopython
> parser in it's tracks (maybe a feature instead of a bug!).
>
> You should try adding a unique identifier (maybe the name of the file or
> record) to the LOCUS line to see if it works:
>
> LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006
>
> The bioperl parser in CVS writes out the correct alphabet when this is
> added:
>
> LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006
>
> I'll try adding a warning to the bioperl parser for this.
I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305 but let me
emphasize the LOCUS line now contains
LOCUS pRL 5428 bp ds-DNA linear 07-JUN-2007
which still does not comply with the line you have proposed. But it can be
parsed by bioperl-live from cvs. Is it still wrong? Testcase as pRL.gb-new
in the bugzilla record #2305.
Martin
>
> chris
>
> On Jun 5, 2007, at 10:28 AM, Chris Fields wrote:
>
>> Martin,
>>
>> The example file you give in the bioperl bugzilla report has several
>> blank annotation lines which may lead to additional problems. When
>> the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM,
>> DEFINITION, etc) then it expects there will also be relevant data
>> (text descriptions) accompanying it; I assume the BioPython parser
>> expects likewise though I may be wrong.
>>
>> AFAIK the inclusion of field names w/o text isn't GenBank/EMBL-
>> compliant. GenBank records lacking text either have a '.' instead or
>> are left out entirely:
>>
>> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
>>
>> We could add a fix but you should probably contact the ApE developers
>> and request that field names w/o text be left out or have '.' added.
>>
>> chris
>>
>> On Jun 5, 2007, at 9:04 AM, Martin MOKREJŠ wrote:
>>
>>> Ezequiel Panepucci wrote:
>>>>> genbank entry = parser.parse(fhandle)
>>>>
>>>> there is a space character between "genbank" and "entry".
>>>> It is a syntax error.
>>>> I suppose you meant "genbank_entry" ?
>>>
>>> Yes, the next command was right and has shown the error. Sorry, I
>>> forgot
>>> to delete the first attempt. ;-)
>>>
>>>>>> genbank_entry = parser.parse(fhandle)
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in ?
>>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py",
>>> line 187, in parse
>>> self._scanner.feed(handle, self._consumer)
>>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py",
>>> line 360, in feed
>>> self._feed_first_line(consumer, self.line)
>>> File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py",
>>> line 835, in _feed_first_line
>>> assert False, \
>>> AssertionError: Did not recognise the LOCUS line layout:
>>> LOCUS 6499 bp ds-DNA linear 02-AUG-2006
>>>
>>>>>>
>>>
>>> Martin
>>> _______________________________________________
>>> BioPython mailing list - BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> BioPython mailing list - BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>
>
--
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs
More information about the Biopython
mailing list