[Biopython-dev] UniProt GOA parser

Iddo Friedberg idoerg at gmail.com
Fri May 17 21:35:41 UTC 2013


OK. I added a few changes as suggested by Peter.

There is a parser now to group GAF files by DB_Object_ID, and a write
function to write them. Random access not implemented yet.

On Fri, May 10, 2013 at 12:32 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

>
>
> On Fri, May 10, 2013 at 12:26 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
>> >>
>> >> Would it make sense to want random access to the GOA files based
>> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
>> >> should be fairly straight forward to do building on the indexing code
>> >> for Bio.SeqIO and SearchIO.
>> >
>> >
>> > Would that require reading it all into memory? Uniprot_GOA files
>> > are huge, it is impractical to read them in fully.
>>
>> Not at all - we'd record a dictionary mapping the record ID to an offset
>> in the file on disk, or record this mapping in an SQLite index file.
>>
>
>  Ok, that's good then
>
>
>> >> Note here I am picturing combining all the (consecutive) lines
>> >> for the same DB_Object_ID - currently the parser is line based,
>> >> but batching by DB_Object_ID would be a straightforward change
>> >> and may better suit some uses.
>> >
>> > Perhaps only for organism specific file, which in some cases can
>> > be read fully into memory.
>>
>> The examples I looked at only seemed to have a dozen or so
>> lines for each DB_Object_ID - but perhaps these were easy
>> cases? How many lines per DB_Object_ID in the worst cases?
>>
>> Peter
>>
>
>
> I was actually thinking you are suggesting that the whole file should be
> read in memory, nit just buffer by DB-Object_ID.  My mistake.
>
>
> --
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
> >>----.<--.>++++++.<<<<------------------------------------.
>



-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.



More information about the Biopython-dev mailing list