[Biopython-dev] Bio.GFF and Brad's code

Michiel de Hoon mjldehoon at yahoo.com
Sat Apr 18 04:28:09 UTC 2009


I tried this code to read a GFF file from miRBase, containing the genome positions of microRNAs in human. The good news is that the code works as advertised. At the same time, I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), the SeqFeatures are way too complicated for my mind.

This is how I used the parser:

>>> from GFFParser import GFFAddingIterator
>>> gff_iterator = GFFAddingIterator()
>>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff")
# It would be better to pass a handle to get_all_features
# instead of a file name. The file may be gzipped or bzipped,
# or the user may want to read it from the internet.
>>> len(rec_dict['1'])
50
# fifty microRNAs on chromosome 1
>>> rec_dict['1'].features[0]                 
Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2')
>>> rec_dict['1'].features[0].qualifiers['ACC']
['MI0006363']
>>> rec_dict['1'].features[0].qualifiers['ID']
['hsa-mir-1302-2']
# This is still OK, though a bit more deeply nested than I would like.
>>> rec_dict['1'].features[0].location       
Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366))
>>> rec_dict['1'].features[0].location._start
Bio.SeqFeature.ExactPosition(20228)
# Am I supposed to use _start here? It looks like a private variable.
>>> rec_dict['1'].features[0].location._start.position
20228
# Too much typing for everyday usage. I don't think that I would use it.

For a basic parser, I like the _gff_line_map function much better. Applied to the first line in the GFF file, it returns

>>> result = _gff_line_map(line, params)
[('parent', {'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1})]
>>> print result[0][1]
{'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1}

which is exactly what I need, in (almost) the places where I'd expect them.

--Michiel


      



More information about the Biopython-dev mailing list