[Bioperl-l] Bio::ASN1::EntrezGene now on Github

Wed Sep 11 06:48:13 UTC 2013

On 11 September 2013 04:50, Fields, Christopher J <cjfields at illinois.edu> wrote:
> On Sep 10, 2013, at 12:05 PM, Carnë Draug <carandraug+dev at gmail.com>
>  wrote:
>
>> There's also the thing that next_seq() was always returning an array
>> reference with only 1 element (the one sequence). Now there will be
>> more elements (one per sequence in the Entrezgene set). People
>> expecting the old behaviour will be skipping data unless maybe some
>> warning is printed or some other change is performed. Specially
>> next_seq returning many sequences (the next set) is misleading.
>
> Would you want a method that implies it only returns one thing (e.g. next_seq)?  Could you make another method that returns data in batches (next_seq_set?  not sure) then rewrite next_seq in terms of it?

I just think that such method would be more explicit about what it's
actually doing. But I don't have any really good idea in how to handle
this at the moment in good design. I have been away for too long on
this to give a conscious solution.

>> Finally, there's also the thing that the patch reads an entire set
>> which for all we know, can be thousands of sequences.
>>
>> Carnë
>
> That may be more problematic, yes.  The solutions depend on how the ASN1 parser is set up, which I'm not totally familiar with.  If you are worried about thousands of genes, then maybe a Bio::Cluster-like class for grouping data that generates the objects lazily?

I am not worried about it for myself. I'm just commenting that if
someones throws a giant Entrezgene set at it, they might be surprised.
But such sets seem to be rare and the NCBI retracted the change on
their side after I contacted them so maybe it will never be a problem
for anyone.

>> [1] http://bioperl.996286.n3.nabble.com/sets-of-sequences-how-to-read-td16940.html
>
> I didn't get back to this right away, but I unfortunately already pushed some changes to the repo.  You should still be able to merge your work in, though, or I can back mine out into a branch and let you merge.  Most of mine are to remove the circular dependency issue with Bioperl.

That was not a problem. It was only a 5 characters change so I just
made a new commit. I have then made a bunch more changes to make use
of BioPerl's Dist::Zilla plugin.

Anyway, while going through the code I noticed there's a lot of
repeated code. A diff between EntrezGene.pm and Sequence.pm will give
a good idea in how much we could save by handling it a bit better (and
avoid future maintenance problems when bugs are fixed in one side but
not the other). Same goes for the two Indexer.pm files.

Carnë