[Biojava-dev] Renewed Genbank files parsing support

Paolo Pavan paolo.pavan at gmail.com
Mon Dec 8 16:27:45 UTC 2014


Hi Andreas,
No, this isn't the situation, maybe I haven't been clear.
Me and Jacek had requirements that the old parser didn't met and we work
together on it to fill those. Now the parser works and we fix also some
issues that affect some other already present parts.

So, yes, I think that our work is ready for a pull request and will
definitely be an improvement for the actual situation.

There were some perplexity on the two ways supported by biojava to load a
genbank file, that required different solutions but they are at developer
side, not at biojava user side. The semantic of usage is invariated so no
worries about that and no impact on existing code.

Just Jacek is just adding new test instances, but I guess it will not take
so long.

I hope I made myself clear .
Regards,
Paolo
Il giorno 08/dic/2014 06:21, "Andreas Prlic" <andreas at sdsc.edu> ha scritto:

> Hi Paolo,
>
> So to summarize the current situation: Both you and Jacek had requirements
> that the current genbank parser could not meet. You both forked it and
> worked on independent branches. You came up with a solution that works for
> you but you think it is currently still too complicated to share with
> everybody and provide a pull request. Does that sum it up correctly?
>
> Besides the Genbank parser being too complicated, we also have the problem
> that currently nobody really "owns" this feature and is interested to take
> it to the next level.
>
> Thanks,
>
> Andreas
>
>
>
>
> On Sun Dec 07 2014 at 2:00:58 AM Paolo Pavan <paolo.pavan at gmail.com>
> wrote:
>
>> Dear all,
>> Thank you Andreas for your review I think I've got the sense. Thank you
>> also to Jacek for your proposal.
>> If I can add my opinion I must say that both the api work and now benefit
>> of the new implemented parser and so there is no urgent need of any change.
>> The plot might be that they are not very very intuitive and might cause
>> delays and discourage the add of new features by other developers.
>>
>> For the records I will summarize here how the proxy system works.
>> GenbankProxySequenceReader realizes DatabaseReferenceInterface and
>> FeatureKeywordInterface that declare getDatabaseReferences() and
>> getKeyWords() methods. These interfaces are introspected by
>> AbstractSequence, in our case at time of construction with use of
>> proxyLoader (AbstractSequence(SequenceReader<C> proxyLoader, CompoundSet<C>
>> compoundSet)) and are used at this time to populate instantiated sequence
>> object with database and keyword information.
>>
>> In other words the original idea was to declare a new interface for every
>> category of property that must populate a sequence object and this logic
>> will be in charge of the AbstractSequence construction with ProxyReader
>> use. A developer must add the loading code here.
>>
>> Said that all the things work and the current code is high-level, if we
>> would catch Andreas' cleaning proposal, I think that the only effort that
>> make sense to profuse will involve a new, simpler and plain re-design of
>> the data IO api more then providing new interfaces to the current already
>> overcrowded system. I know, this is the hard and long way but in my opinion
>> is the only valid improvement we could really do at this point.
>>
>> I have some ideas and some experience on this. I am imagining an api that
>> is easy to extend: one developer that wants to add a new parser, must just
>> write the parser and plug it into the system to work.
>> I would delegate to the sequence class the mere role of a data structure
>> (the most important in bioinformatics along with alignment indeed). The
>> only methods allowed would be those to manipulate the sequence
>> representation.
>>
>> But anyway I don't really know if we want to enter such long and
>> difficult road and actually it cannot involve just two developers. It's a
>> feature for biojava5, perhaps ;-)
>>
>> Greetings !
>> Paolo
>>
>> Il giorno 03/dic/2014 13:13, "Jacek Grzebyta" <grzebyta.dev at gmail.com>
>> ha scritto:
>> >
>> > Hi,
>> >
>> > If it than looks like that I suggest to change the proxy Interface. It
>> could have a getter for data source instance from
>> org.biojava3.core.sequence.DataSource. Than create an abstract Proxy
>> instance which will map a datasource into relevant URI. But we need to take
>> into consideration that each (or more of them) would require unique API
>> anyway to proxy a data. Long time ago I tried to do it but gave up after I
>> discovered RDF and semantic web. anyway I will do changes and submit to my
>> branch repository.
>> >
>> >
>> > Regards,
>> >
>> > Jacek
>>
>>
>> >
>> >
>> >
>> > Hi Paolo,
>> >
>> > I don't remember the full history of this, but after having reviewed the
>> > code I think the story is like this:
>> >
>> >  The "proxy" means that an entry can be fetched from an external DB
>> based
>> > on a reference ID.
>> >
>> > Then there is another requirement to read a single record from a file
>> > containing many entries. (hence the differences between InputStream and
>> > Bufferedreader), which might explain the different approaches.s
>> >
>> > Having said that, I do think the API is inconsistent and could benefit
>> from
>> > some cleanup and also we need better documentation for this. Any pull
>> > requests are welcome!
>> >
>> > Andreas
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
>> wrote:
>> >
>> > > Dear all,
>> > > Me and Jacek Grzebyta have added support for reading features,
>> qualifiers
>> > > and nested locations with "split" indications in genbank files and we
>> hope
>> > > this feature will be included in the next 4.0 release.
>> > >
>> > > Anyway we face the existing of two ways to parse a genbank file: via
>> > > GenbankProxySequenceReader and via GenbankReader. Both use the same
>> > > underlying GenbankSequenceParser now updated, but in different ways.
>> > >
>> > > Is there a reason that escapes to me of why such a dichotomy design
>> or is
>> > > just the result of the efforts of two independent working groups? This
>>
>> > > ?proxy? naming suggests me it wants to add something more to the
>> standard
>> > > GenbankReader, isn?t it? There is an advised one? One difference is
>> that
>>
>>
>> > > one is using an InputStream, the second a BufferedReader.
>> > >
>> > > Can someone of the original authors add any note on that?
>> > >
>> > > Thank you very much,
>> > > Paolo
>> > >
>> > > _______________________________________________
>> > > biojava-dev mailing list
>> > > biojava-dev at mailman.open-bio.org
>> > > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>> > >
>>
>> > -------------- next part --------------
>> > An HTML attachment was scrubbed...
>> > URL: <
>> http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141124/7b9d0f9a/attachment-0001.html
>> >
>> >
>> > ------------------------------
>>
>>
>> >
>> > _______________________________________________
>> > biojava-dev mailing list
>> > biojava-dev at mailman.open-bio.org
>> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>> >
>>
>> > End of biojava-dev Digest, Vol 140, Issue 4
>> > *******************************************
>>
>>
>> >
>> > _______________________________________________
>> > biojava-dev mailing list
>> > biojava-dev at mailman.open-bio.org
>> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>
>> Il giorno 25/nov/2014 05:15, "Andreas Prlic" <andreas at sdsc.edu> ha
>> scritto:
>>
>> Hi Paolo,
>>>
>>> I don't remember the full history of this, but after having reviewed the
>>> code I think the story is like this:
>>>
>>>  The "proxy" means that an entry can be fetched from an external DB
>>> based on a reference ID.
>>>
>>> Then there is another requirement to read a single record from a file
>>> containing many entries. (hence the differences between InputStream and
>>> Bufferedreader), which might explain the different approaches.s
>>>
>>> Having said that, I do think the API is inconsistent and could benefit
>>> from some cleanup and also we need better documentation for this. Any pull
>>> requests are welcome!
>>>
>>> Andreas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>> wrote:
>>>
>>>> Dear all,
>>>> Me and Jacek Grzebyta have added support for reading features,
>>>> qualifiers and nested locations with "split" indications in genbank files
>>>> and we hope this feature will be included in the next 4.0 release.
>>>>
>>>> Anyway we face the existing of two ways to parse a genbank file: via
>>>> GenbankProxySequenceReader and via GenbankReader. Both use the same
>>>> underlying GenbankSequenceParser now updated, but in different ways.
>>>>
>>>> Is there a reason that escapes to me of why such a dichotomy design or
>>>> is just the result of the efforts of two independent working groups? This
>>>> “proxy” naming suggests me it wants to add something more to the standard
>>>> GenbankReader, isn’t it? There is an advised one? One difference is that
>>>> one is using an InputStream, the second a BufferedReader.
>>>>
>>>> Can someone of the original authors add any note on that?
>>>>
>>>> Thank you very much,
>>>> Paolo
>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at mailman.open-bio.org
>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141208/692e2fae/attachment.html>


More information about the biojava-dev mailing list