[Biopython] Bug in Geo.parser when reading some GDS files
    Erik C 
    erikclarke at gmail.com
       
    Mon Apr 23 23:54:20 UTC 2012
    
    
  
Hi all,
When parsing a NCBI GEO dataset (GDS) file such as this:
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS_full/GDS1962_full.soft.gz
the Bio.Geo.parse(handle) method fails with an assertion error. Example
code:
>> for record in Geo.parse(open('GDS1962_full.soft')): print record
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Geo/__init__.py", line 54, in parse
    assert key not in record.col_defs
AssertionError
It appears that this is due to the failed assumption that each column
header exists only once, when it seems that a common trend in GDS files is
to have two columns each titled GO:Function, GO:Process, and GO:Component.
The first of these duplicate columns is the Gene Ontology terms for the
probe at that row, and the second column is the GO ids for those terms.
>From GDS3646_full.soft:
#GO:Function = Gene Ontology Function term
#GO:Process = Gene Ontology Process term
#GO:Component = Gene Ontology Component term
#GO:Function = Gene Ontology Function identifier
#GO:Process = Gene Ontology Process identifier
#GO:Component = Gene Ontology Component identifier
While the duplicate header names is not ideal for tabular data, these GO
columns do seem to appear regularly for GDS files (see GDS1962, GDS3646,
and others) and they consistently break the parser. There should be a
disabling of this assertion for this particular case or a more flexible
column header check. I suggest using the assertion only for the sample
columns (those prefixed with GSM).
I'm using BioPython 1.59 (issue exists also in Git repository) with Python
2.7.1 on Mac OS 10.7.3.
Cheers,
Erik
    
    
More information about the Biopython
mailing list