[Biopython-dev] UniProt GOA parser

Fri May 10 10:06:19 UTC 2013

On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
> A new uniprot-GOA parser is available for you to poke around:
>
> https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA
>

I think for the namespace, we might be better off using Bio.UniProt.GOA,
where Iddo's parser would be in Bio/UniProt/GOA.py and any other
UniProt specific code could also go under Bio/UniProt - for example
a web API.

Some of Bio.SwissProt might also migrate here over time.

> More on Uniprot-GOA: http://www.ebi.ac.uk/GOA
>
> There are three file formats: GAF (gene association file) , GPA (gene
> product association) and GPI (gene product information) explained here:
> http://www.ebi.ac.uk/GOA/downloads
>
> Input GAF files can be very large, due to the growth of uniprot GOA. If you
> would like to test in a timely fashion, I suggest you get historical files,
> which are smaller. Once you get to the > 40 version numbers, the runtime
> for the example code in UniProtGOA.py goes over 2 minutes (on my i5
> machine).

Would it make sense to want random access to the GOA files based
on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
should be fairly straight forward to do building on the indexing code
for Bio.SeqIO and SearchIO.

Note here I am picturing combining all the (consecutive) lines
for the same DB_Object_ID - currently the parser is line based,
but batching by DB_Object_ID would be a straightforward change
and may better suit some uses.

> Old GAF files are available here:
> ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
>
> Current GPI and GPA files are not very large.
>
> Thanks to Peter for his help on this.
>
> Best,
>
> Iddo

Peter