[Biojava-dev] Renewed Genbank files parsing support

Paolo Pavan paolo.pavan at gmail.com
Tue Dec 9 10:57:30 UTC 2014


You're welcome!

Paolo

2014-12-09 3:05 GMT+01:00 Andreas Prlic <andreas at sdsc.edu>:

> No, this isn't the situation, maybe I haven't been clear.
>>
>
> apologies for the misunderstanding.
>
>
>> Me and Jacek had requirements that the old parser didn't met and we work
>> together on it to fill those. Now the parser works and we fix also some
>> issues that affect some other already present parts.
>>
>
> Excellent!
>
>> So, yes, I think that our work is ready for a pull request and will
>> definitely be an improvement for the actual situation.
>>
>
> Just Jacek is just adding new test instances, but I guess it will not take
>> so long.
>>
> Ok great! Thanks to both of you for your efforts and looking forward to
> the pull request!
>
> Andreas
>
>
>
>
>
> Il giorno 08/dic/2014 06:21, "Andreas Prlic" <andreas at sdsc.edu> ha
>> scritto:
>>
>> Hi Paolo,
>>>
>>> So to summarize the current situation: Both you and Jacek had
>>> requirements that the current genbank parser could not meet. You both
>>> forked it and worked on independent branches. You came up with a solution
>>> that works for you but you think it is currently still too complicated to
>>> share with everybody and provide a pull request. Does that sum it up
>>> correctly?
>>>
>>> Besides the Genbank parser being too complicated, we also have the
>>> problem that currently nobody really "owns" this feature and is interested
>>> to take it to the next level.
>>>
>>> Thanks,
>>>
>>> Andreas
>>>
>>>
>>>
>>>
>>> On Sun Dec 07 2014 at 2:00:58 AM Paolo Pavan <paolo.pavan at gmail.com>
>>> wrote:
>>>
>>>> Dear all,
>>>> Thank you Andreas for your review I think I've got the sense. Thank you
>>>> also to Jacek for your proposal.
>>>> If I can add my opinion I must say that both the api work and now
>>>> benefit of the new implemented parser and so there is no urgent need of any
>>>> change.
>>>> The plot might be that they are not very very intuitive and might cause
>>>> delays and discourage the add of new features by other developers.
>>>>
>>>> For the records I will summarize here how the proxy system works.
>>>> GenbankProxySequenceReader realizes DatabaseReferenceInterface and
>>>> FeatureKeywordInterface that declare getDatabaseReferences() and
>>>> getKeyWords() methods. These interfaces are introspected by
>>>> AbstractSequence, in our case at time of construction with use of
>>>> proxyLoader (AbstractSequence(SequenceReader<C> proxyLoader, CompoundSet<C>
>>>> compoundSet)) and are used at this time to populate instantiated sequence
>>>> object with database and keyword information.
>>>>
>>>> In other words the original idea was to declare a new interface for
>>>> every category of property that must populate a sequence object and this
>>>> logic will be in charge of the AbstractSequence construction with
>>>> ProxyReader use. A developer must add the loading code here.
>>>>
>>>> Said that all the things work and the current code is high-level, if we
>>>> would catch Andreas' cleaning proposal, I think that the only effort that
>>>> make sense to profuse will involve a new, simpler and plain re-design of
>>>> the data IO api more then providing new interfaces to the current already
>>>> overcrowded system. I know, this is the hard and long way but in my opinion
>>>> is the only valid improvement we could really do at this point.
>>>>
>>>> I have some ideas and some experience on this. I am imagining an api
>>>> that is easy to extend: one developer that wants to add a new parser, must
>>>> just write the parser and plug it into the system to work.
>>>> I would delegate to the sequence class the mere role of a data
>>>> structure (the most important in bioinformatics along with alignment
>>>> indeed). The only methods allowed would be those to manipulate the sequence
>>>> representation.
>>>>
>>>> But anyway I don't really know if we want to enter such long and
>>>> difficult road and actually it cannot involve just two developers. It's a
>>>> feature for biojava5, perhaps ;-)
>>>>
>>>> Greetings !
>>>> Paolo
>>>>
>>>> Il giorno 03/dic/2014 13:13, "Jacek Grzebyta" <grzebyta.dev at gmail.com>
>>>> ha scritto:
>>>> >
>>>> > Hi,
>>>> >
>>>> > If it than looks like that I suggest to change the proxy Interface.
>>>> It could have a getter for data source instance from
>>>> org.biojava3.core.sequence.DataSource. Than create an abstract Proxy
>>>> instance which will map a datasource into relevant URI. But we need to take
>>>> into consideration that each (or more of them) would require unique API
>>>> anyway to proxy a data. Long time ago I tried to do it but gave up after I
>>>> discovered RDF and semantic web. anyway I will do changes and submit to my
>>>> branch repository.
>>>> >
>>>> >
>>>> > Regards,
>>>> >
>>>> > Jacek
>>>>
>>>>
>>>> >
>>>> >
>>>> >
>>>> > Hi Paolo,
>>>> >
>>>> > I don't remember the full history of this, but after having reviewed
>>>> the
>>>> > code I think the story is like this:
>>>> >
>>>> >  The "proxy" means that an entry can be fetched from an external DB
>>>> based
>>>> > on a reference ID.
>>>> >
>>>> > Then there is another requirement to read a single record from a file
>>>> > containing many entries. (hence the differences between InputStream
>>>> and
>>>> > Bufferedreader), which might explain the different approaches.s
>>>> >
>>>> > Having said that, I do think the API is inconsistent and could
>>>> benefit from
>>>> > some cleanup and also we need better documentation for this. Any pull
>>>> > requests are welcome!
>>>> >
>>>> > Andreas
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>> wrote:
>>>> >
>>>> > > Dear all,
>>>> > > Me and Jacek Grzebyta have added support for reading features,
>>>> qualifiers
>>>> > > and nested locations with "split" indications in genbank files and
>>>> we hope
>>>> > > this feature will be included in the next 4.0 release.
>>>> > >
>>>> > > Anyway we face the existing of two ways to parse a genbank file: via
>>>> > > GenbankProxySequenceReader and via GenbankReader. Both use the same
>>>> > > underlying GenbankSequenceParser now updated, but in different ways.
>>>> > >
>>>> > > Is there a reason that escapes to me of why such a dichotomy design
>>>> or is
>>>> > > just the result of the efforts of two independent working groups?
>>>> This
>>>>
>>>> > > ?proxy? naming suggests me it wants to add something more to the
>>>> standard
>>>> > > GenbankReader, isn?t it? There is an advised one? One difference is
>>>> that
>>>>
>>>>
>>>> > > one is using an InputStream, the second a BufferedReader.
>>>> > >
>>>> > > Can someone of the original authors add any note on that?
>>>> > >
>>>> > > Thank you very much,
>>>> > > Paolo
>>>> > >
>>>> > > _______________________________________________
>>>> > > biojava-dev mailing list
>>>> > > biojava-dev at mailman.open-bio.org
>>>> > > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>> > >
>>>>
>>>> > -------------- next part --------------
>>>> > An HTML attachment was scrubbed...
>>>> > URL: <
>>>> http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141124/7b9d0f9a/attachment-0001.html
>>>> >
>>>> >
>>>> > ------------------------------
>>>>
>>>>
>>>> >
>>>> > _______________________________________________
>>>> > biojava-dev mailing list
>>>> > biojava-dev at mailman.open-bio.org
>>>> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>> >
>>>>
>>>> > End of biojava-dev Digest, Vol 140, Issue 4
>>>> > *******************************************
>>>>
>>>>
>>>> >
>>>> > _______________________________________________
>>>> > biojava-dev mailing list
>>>> > biojava-dev at mailman.open-bio.org
>>>> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>>> Il giorno 25/nov/2014 05:15, "Andreas Prlic" <andreas at sdsc.edu> ha
>>>> scritto:
>>>>
>>>> Hi Paolo,
>>>>>
>>>>> I don't remember the full history of this, but after having reviewed
>>>>> the code I think the story is like this:
>>>>>
>>>>>  The "proxy" means that an entry can be fetched from an external DB
>>>>> based on a reference ID.
>>>>>
>>>>> Then there is another requirement to read a single record from a file
>>>>> containing many entries. (hence the differences between InputStream and
>>>>> Bufferedreader), which might explain the different approaches.s
>>>>>
>>>>> Having said that, I do think the API is inconsistent and could benefit
>>>>> from some cleanup and also we need better documentation for this. Any pull
>>>>> requests are welcome!
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear all,
>>>>>> Me and Jacek Grzebyta have added support for reading features,
>>>>>> qualifiers and nested locations with "split" indications in genbank files
>>>>>> and we hope this feature will be included in the next 4.0 release.
>>>>>>
>>>>>> Anyway we face the existing of two ways to parse a genbank file: via
>>>>>> GenbankProxySequenceReader and via GenbankReader. Both use the same
>>>>>> underlying GenbankSequenceParser now updated, but in different ways.
>>>>>>
>>>>>> Is there a reason that escapes to me of why such a dichotomy design
>>>>>> or is just the result of the efforts of two independent working groups?
>>>>>> This “proxy” naming suggests me it wants to add something more to the
>>>>>> standard GenbankReader, isn’t it? There is an advised one? One difference
>>>>>> is that one is using an InputStream, the second a BufferedReader.
>>>>>>
>>>>>> Can someone of the original authors add any note on that?
>>>>>>
>>>>>> Thank you very much,
>>>>>> Paolo
>>>>>>
>>>>>> _______________________________________________
>>>>>> biojava-dev mailing list
>>>>>> biojava-dev at mailman.open-bio.org
>>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141209/eff9f6f9/attachment.html>


More information about the biojava-dev mailing list