[Biojava-dev] Renewed Genbank files parsing support

Paolo Pavan paolo.pavan at gmail.com
Sun Dec 7 10:00:55 UTC 2014


Dear all,
Thank you Andreas for your review I think I've got the sense. Thank you
also to Jacek for your proposal.
If I can add my opinion I must say that both the api work and now benefit
of the new implemented parser and so there is no urgent need of any change.
The plot might be that they are not very very intuitive and might cause
delays and discourage the add of new features by other developers.

For the records I will summarize here how the proxy system works.
GenbankProxySequenceReader realizes DatabaseReferenceInterface and
FeatureKeywordInterface that declare getDatabaseReferences() and
getKeyWords() methods. These interfaces are introspected by
AbstractSequence, in our case at time of construction with use of
proxyLoader (AbstractSequence(SequenceReader<C> proxyLoader, CompoundSet<C>
compoundSet)) and are used at this time to populate instantiated sequence
object with database and keyword information.

In other words the original idea was to declare a new interface for every
category of property that must populate a sequence object and this logic
will be in charge of the AbstractSequence construction with ProxyReader
use. A developer must add the loading code here.

Said that all the things work and the current code is high-level, if we
would catch Andreas' cleaning proposal, I think that the only effort that
make sense to profuse will involve a new, simpler and plain re-design of
the data IO api more then providing new interfaces to the current already
overcrowded system. I know, this is the hard and long way but in my opinion
is the only valid improvement we could really do at this point.

I have some ideas and some experience on this. I am imagining an api that
is easy to extend: one developer that wants to add a new parser, must just
write the parser and plug it into the system to work.
I would delegate to the sequence class the mere role of a data structure
(the most important in bioinformatics along with alignment indeed). The
only methods allowed would be those to manipulate the sequence
representation.

But anyway I don't really know if we want to enter such long and difficult
road and actually it cannot involve just two developers. It's a feature for
biojava5, perhaps ;-)

Greetings !
Paolo

Il giorno 03/dic/2014 13:13, "Jacek Grzebyta" <grzebyta.dev at gmail.com> ha
scritto:
>
> Hi,
>
> If it than looks like that I suggest to change the proxy Interface. It
could have a getter for data source instance from
org.biojava3.core.sequence.DataSource. Than create an abstract Proxy
instance which will map a datasource into relevant URI. But we need to take
into consideration that each (or more of them) would require unique API
anyway to proxy a data. Long time ago I tried to do it but gave up after I
discovered RDF and semantic web. anyway I will do changes and submit to my
branch repository.
>
>
> Regards,
>
> Jacek
>
>
>
> Hi Paolo,
>
> I don't remember the full history of this, but after having reviewed the
> code I think the story is like this:
>
>  The "proxy" means that an entry can be fetched from an external DB based
> on a reference ID.
>
> Then there is another requirement to read a single record from a file
> containing many entries. (hence the differences between InputStream and
> Bufferedreader), which might explain the different approaches.s
>
> Having said that, I do think the API is inconsistent and could benefit
from
> some cleanup and also we need better documentation for this. Any pull
> requests are welcome!
>
> Andreas
>
>
>
>
>
>
>
>
> On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
wrote:
>
> > Dear all,
> > Me and Jacek Grzebyta have added support for reading features,
qualifiers
> > and nested locations with "split" indications in genbank files and we
hope
> > this feature will be included in the next 4.0 release.
> >
> > Anyway we face the existing of two ways to parse a genbank file: via
> > GenbankProxySequenceReader and via GenbankReader. Both use the same
> > underlying GenbankSequenceParser now updated, but in different ways.
> >
> > Is there a reason that escapes to me of why such a dichotomy design or
is
> > just the result of the efforts of two independent working groups? This
> > ?proxy? naming suggests me it wants to add something more to the
standard
> > GenbankReader, isn?t it? There is an advised one? One difference is that
> > one is using an InputStream, the second a BufferedReader.
> >
> > Can someone of the original authors add any note on that?
> >
> > Thank you very much,
> > Paolo
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biojava-dev
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141124/7b9d0f9a/attachment-0001.html
>
>
> ------------------------------
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>
> End of biojava-dev Digest, Vol 140, Issue 4
> *******************************************
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
Il giorno 25/nov/2014 05:15, "Andreas Prlic" <andreas at sdsc.edu> ha scritto:

> Hi Paolo,
>
> I don't remember the full history of this, but after having reviewed the
> code I think the story is like this:
>
>  The "proxy" means that an entry can be fetched from an external DB based
> on a reference ID.
>
> Then there is another requirement to read a single record from a file
> containing many entries. (hence the differences between InputStream and
> Bufferedreader), which might explain the different approaches.s
>
> Having said that, I do think the API is inconsistent and could benefit
> from some cleanup and also we need better documentation for this. Any pull
> requests are welcome!
>
> Andreas
>
>
>
>
>
>
>
>
> On Sat, Nov 22, 2014 at 12:10 PM, Paolo Pavan <paolo.pavan at gmail.com>
> wrote:
>
>> Dear all,
>> Me and Jacek Grzebyta have added support for reading features, qualifiers
>> and nested locations with "split" indications in genbank files and we hope
>> this feature will be included in the next 4.0 release.
>>
>> Anyway we face the existing of two ways to parse a genbank file: via
>> GenbankProxySequenceReader and via GenbankReader. Both use the same
>> underlying GenbankSequenceParser now updated, but in different ways.
>>
>> Is there a reason that escapes to me of why such a dichotomy design or is
>> just the result of the efforts of two independent working groups? This
>> “proxy” naming suggests me it wants to add something more to the standard
>> GenbankReader, isn’t it? There is an advised one? One difference is that
>> one is using an InputStream, the second a BufferedReader.
>>
>> Can someone of the original authors add any note on that?
>>
>> Thank you very much,
>> Paolo
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20141207/cda63521/attachment.html>


More information about the biojava-dev mailing list