[Biojava-dev] Renewed Genbank files parsing support

Andreas Prlic andreas at sdsc.edu
Tue Dec 9 02:05:49 UTC 2014


>
> No, this isn't the situation, maybe I haven't been clear.
>

apologies for the misunderstanding.


> Me and Jacek had requirements that the old parser didn't met and we work
> together on it to fill those. Now the parser works and we fix also some
> issues that affect some other already present parts.
>

Excellent!

> So, yes, I think that our work is ready for a pull request and will
> definitely be an improvement for the actual situation.
>

Just Jacek is just adding new test instances, but I guess it will not take
> so long.
>
Ok great! Thanks to both of you for your efforts and looking forward to the
pull request!

Andreas





Il giorno 08/dic/2014 06:21, "Andreas Prlic" <andreas at sdsc.edu> ha scritto:
>
> Hi Paolo,
>>
>> So to summarize the current situation: Both you and Jacek had
>> requirements that the current genbank parser could not meet. You both
>> forked it and worked on independent branches. You came up with a solution
>> that works for you but you think it is currently still too complicated to
>> share with everybody and provide a pull request. Does that sum it up
>> correctly?
>>
>> Besides the Genbank parser being too complicated, we also have the
>> problem that currently nobody really "owns" this feature and is interested
>> to take it to the next level.
>>
>> Thanks,
>>
>> Andreas
>>
>>
>>
>>
>> On Sun Dec 07 2014 at 2:00:58 AM Paolo Pavan <paolo.pavan at gmail.com>
>> wrote:
>>
>>> Dear all,
>>> Thank you Andreas for your review I think I've got the sense. Thank you
>>> also to Jacek for your proposal.
>>> If I can add my opinion I must say that both the api work and now
>>> benefit of the new implemented parser and so there is no urgent need of any
>>> change.
>>> The plot might be that they are not very very intuitive and might cause
>>> delays and discourage the add of new features by other developers.
>>>
>>> For the records I will summarize here how the proxy system works.
>>> GenbankProxySequenceReader realizes DatabaseReferenceInterface and
>>> FeatureKeywordInterface that declare getDatabaseReferences() and
>>> getKeyWords() methods. These interfaces are introspected by
>>> AbstractSequence, in our case at time of construction with use of
>>> proxyLoader (AbstractSequence(SequenceReader<C> proxyLoader, CompoundSet<C>
>>> compoundSet)) and are used at this time to populate instantiated sequence
>>> object with database and keyword information.
>>>
>>> In other words the original idea was to declare a new interface for
>>> every category of property that must populate a sequence object and this
>>> logic will be in charge of the AbstractSequence construction with
>>> ProxyReader use. A developer must add the loading code here.
>>>
>>> Said that all the things work and the current code is high-level, if we
>>> would catch Andreas' cleaning proposal, I think that the only effort that
>>> make sense to profuse will involve a new, simpler and plain re-design of
>>> the data IO api more then providing new interfaces to the current already
>>> overcrowded system. I know, this is the hard and long way but in my opinion
>>> is the only valid improvement we could really do at this point.
>>>
>>> I have some ideas and some experience on this. I am imagining an api
>>> that is easy to extend: one developer that wants to add a new parser, must
>>> just write the parser and plug it into the system to work.
>>> I would delegate to the sequence class the mere role of a data structure
>>> (the most important in bioinformatics along with alignment indeed). The
>>> only methods allowed would be those to manipulate the sequence
>>> representation.
>>>
>>> But anyway I don't really know if we want to enter such long and
>>> difficult road and actually it cannot involve just two developers. It's a
>>> feature for biojava5, perhaps ;-)
>>>
>>> Greetings !
>>> Paolo
>>>
>>> Il giorno 03/dic/2014 13:13, "Jacek Grzebyta" <grzebyta.dev at gmail.com>
>>> ha scritto:
>>> >
>>> > Hi,
>>> >
>>> > If it than looks like that I suggest to change the proxy Interface. It
>>> could have a getter for data source instance from
>>> org.biojava3.core.sequence.DataSource. Than create an abstract Proxy
>>> instance which will map a datasource into relevant URI. But we need to take
>>> into consideration that each (or more of them) would require unique API
>>> anyway to proxy a data. Long time ago I tried to do it but gave up after I
>>> discovered RDF and semantic web. anyway I will do changes and submit to my
>>> branch repository.
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Jacek
>>>
>>>
>>> >
>>> >
>>> >
>>> > Hi Paolo,
>>> >
>>> > I don't remember the full history of this, but after having reviewed
>>> the
>>> > code I think the story is like this:
>>> >
>>> >  The "proxy" means that an entry can be fetched from an external DB
>>> based
>>> > on a reference ID.
>>> >
>>> > Then there is another requirement to read a single record from a file
>>> > containing many entries. (hence the differences between InputStream and
>>> > Bufferedreader), which might explain the different approaches.s
>>> >
>>> > Having said that, I do think the API is inconsistent and could benefit
>>> from
>>> > some cleanup and also we need better documentation for this. Any pull
>>> > requests are welcome!
>>> >
>>> > Andreas
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>> wrote:
>>> >
>>> > > Dear all,
>>> > > Me and Jacek Grzebyta have added support for reading features,
>>> qualifiers
>>> > > and nested locations with "split" indications in genbank files and
>>> we hope
>>> > > this feature will be included in the next 4.0 release.
>>> > >
>>> > > Anyway we face the existing of two ways to parse a genbank file: via
>>> > > GenbankProxySequenceReader and via GenbankReader. Both use the same
>>> > > underlying GenbankSequenceParser now updated, but in different ways.
>>> > >
>>> > > Is there a reason that escapes to me of why such a dichotomy design
>>> or is
>>> > > just the result of the efforts of two independent working groups?
>>> This
>>>
>>> > > ?proxy? naming suggests me it wants to add something more to the
>>> standard
>>> > > GenbankReader, isn?t it? There is an advised one? One difference is
>>> that
>>>
>>>
>>> > > one is using an InputStream, the second a BufferedReader.
>>> > >
>>> > > Can someone of the original authors add any note on that?
>>> > >
>>> > > Thank you very much,
>>> > > Paolo
>>> > >
>>> > > _______________________________________________
>>> > > biojava-dev mailing list
>>> > > biojava-dev at mailman.open-bio.org
>>> > > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>> > >
>>>
>>> > -------------- next part --------------
>>> > An HTML attachment was scrubbed...
>>> > URL: <
>>> http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141124/7b9d0f9a/attachment-0001.html
>>> >
>>> >
>>> > ------------------------------
>>>
>>>
>>> >
>>> > _______________________________________________
>>> > biojava-dev mailing list
>>> > biojava-dev at mailman.open-bio.org
>>> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>> >
>>>
>>> > End of biojava-dev Digest, Vol 140, Issue 4
>>> > *******************************************
>>>
>>>
>>> >
>>> > _______________________________________________
>>> > biojava-dev mailing list
>>> > biojava-dev at mailman.open-bio.org
>>> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>> Il giorno 25/nov/2014 05:15, "Andreas Prlic" <andreas at sdsc.edu> ha
>>> scritto:
>>>
>>> Hi Paolo,
>>>>
>>>> I don't remember the full history of this, but after having reviewed
>>>> the code I think the story is like this:
>>>>
>>>>  The "proxy" means that an entry can be fetched from an external DB
>>>> based on a reference ID.
>>>>
>>>> Then there is another requirement to read a single record from a file
>>>> containing many entries. (hence the differences between InputStream and
>>>> Bufferedreader), which might explain the different approaches.s
>>>>
>>>> Having said that, I do think the API is inconsistent and could benefit
>>>> from some cleanup and also we need better documentation for this. Any pull
>>>> requests are welcome!
>>>>
>>>> Andreas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
>>>> wrote:
>>>>
>>>>> Dear all,
>>>>> Me and Jacek Grzebyta have added support for reading features,
>>>>> qualifiers and nested locations with "split" indications in genbank files
>>>>> and we hope this feature will be included in the next 4.0 release.
>>>>>
>>>>> Anyway we face the existing of two ways to parse a genbank file: via
>>>>> GenbankProxySequenceReader and via GenbankReader. Both use the same
>>>>> underlying GenbankSequenceParser now updated, but in different ways.
>>>>>
>>>>> Is there a reason that escapes to me of why such a dichotomy design or
>>>>> is just the result of the efforts of two independent working groups? This
>>>>> “proxy” naming suggests me it wants to add something more to the standard
>>>>> GenbankReader, isn’t it? There is an advised one? One difference is that
>>>>> one is using an InputStream, the second a BufferedReader.
>>>>>
>>>>> Can someone of the original authors add any note on that?
>>>>>
>>>>> Thank you very much,
>>>>> Paolo
>>>>>
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at mailman.open-bio.org
>>>>> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>>>>
>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141208/ec86ff96/attachment-0001.html>


More information about the biojava-dev mailing list