[Bioperl-l] [BioPython] Cannot parse GenBank file

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Thu Jun 7 14:26:44 UTC 2007


Hi,

Chris Fields wrote:
> One thing I missed which explains the biopython error: the LOCUS line is 
> missing the locus identifier (see the NCBI example record link).  This 
> doesn't choke the bioperl parser but it appears to stop the biopython 
> parser in it's tracks (maybe a feature instead of a bug!).
> 
> You should try adding a unique identifier (maybe the name of the file or 
> record) to the LOCUS line to see if it works:
> 
> LOCUS  testfile           6499 bp ds-DNA     linear       02-AUG-2006
> 
> The bioperl parser in CVS writes out the correct alphabet when this is 
> added:
> 
> LOCUS       testfile                6499 bp    ds-DNA  linear   02-AUG-2006
> 
> I'll try adding a warning to the bioperl parser for this.

I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305 but let me
emphasize the LOCUS line now contains 

LOCUS                      pRL        5428 bp ds-DNA   linear       07-JUN-2007


which still does not comply with the line you have proposed. But it can be
parsed by bioperl-live from cvs. Is it still wrong? Testcase as pRL.gb-new
in the bugzilla record #2305.

Martin

> 
> chris
> 
> On Jun 5, 2007, at 10:28 AM, Chris Fields wrote:
> 
>> Martin,
>>
>> The example file you give in the bioperl bugzilla report has several
>> blank annotation lines which may lead to additional problems.  When
>> the BioPerl SeqIO parser finds annotation fields (SOURCE, ORGANISM,
>> DEFINITION, etc) then it expects there will also be relevant data
>> (text descriptions) accompanying it; I assume the BioPython parser
>> expects likewise though I may be wrong.
>>
>> AFAIK the inclusion of field names w/o text isn't GenBank/EMBL-
>> compliant.  GenBank records lacking text either have a '.' instead or
>> are left out entirely:
>>
>> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
>>
>> We could add a fix but you should probably contact the ApE developers
>> and request that field names w/o text be left out or have '.' added.
>>
>> chris
>>
>> On Jun 5, 2007, at 9:04 AM, Martin MOKREJŠ wrote:
>>
>>> Ezequiel Panepucci wrote:
>>>>>     genbank entry = parser.parse(fhandle)
>>>>
>>>> there is a space character between "genbank" and "entry".
>>>> It is a syntax error.
>>>> I suppose you meant "genbank_entry" ?
>>>
>>> Yes, the next command was right and has shown the error. Sorry, I
>>> forgot
>>> to delete the first attempt. ;-)
>>>
>>>>>> genbank_entry = parser.parse(fhandle)
>>> Traceback (most recent call last):
>>>  File "<stdin>", line 1, in ?
>>>  File "/usr/lib/python2.4/site-packages/Bio/GenBank/__init__.py",
>>> line 187, in parse
>>>    self._scanner.feed(handle, self._consumer)
>>>  File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py",
>>> line 360, in feed
>>>    self._feed_first_line(consumer, self.line)
>>>  File "/usr/lib/python2.4/site-packages/Bio/GenBank/Scanner.py",
>>> line 835, in _feed_first_line
>>>    assert False, \
>>> AssertionError: Did not recognise the LOCUS line layout:
>>> LOCUS               6499 bp ds-DNA     linear       02-AUG-2006
>>>
>>>>>>
>>>
>>> Martin
>>> _______________________________________________
>>> BioPython mailing list  -  BioPython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
> 
> 
> 
> 
> 

-- 
Dr. Martin Mokrejs
Dept. of Genetics and Microbiology
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs





More information about the Bioperl-l mailing list