[Biopython-dev] UniProt GOA parser

Fri May 10 16:26:13 UTC 2013

On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
>>
>> Would it make sense to want random access to the GOA files based
>> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
>> should be fairly straight forward to do building on the indexing code
>> for Bio.SeqIO and SearchIO.
>
>
> Would that require reading it all into memory? Uniprot_GOA files
> are huge, it is impractical to read them in fully.

Not at all - we'd record a dictionary mapping the record ID to an offset
in the file on disk, or record this mapping in an SQLite index file.

>> Note here I am picturing combining all the (consecutive) lines
>> for the same DB_Object_ID - currently the parser is line based,
>> but batching by DB_Object_ID would be a straightforward change
>> and may better suit some uses.
>
> Perhaps only for organism specific file, which in some cases can
> be read fully into memory.

The examples I looked at only seemed to have a dozen or so
lines for each DB_Object_ID - but perhaps these were easy
cases? How many lines per DB_Object_ID in the worst cases?

Peter