[Biojava-dev] [Biojava-l] File parsing in BJ3

Richard Holland holland at eaglegenomics.com
Tue Oct 21 16:13:37 UTC 2008


Yup - why not. Feel free to go in and edit. :)

2008/10/21 Andy Yates <ayates at ebi.ac.uk>

> If "Thing" has gone then what impact does this have on remaining
> classes? Considering methods like canReadNextThing() & readNextThing();
> should this be canReadNext() & readNext()?
>
> Just an idle thought ....
>
> Andy
>
> Richard Holland wrote:
> > The two examples I gave would be better as annotations, its true.
> > Serializable, and Cloneable for that matter, would definitely work better
> > that way.
> >
> > Well, we could do away with Thing altogether then. I'll update the code.
> >
> >
> > 2008/10/21 Mark Schreiber <markjschreiber at gmail.com>
> >
> >> Depending on what you want them for isMachineGenerated(),
> >> isManuallyCurated(), would possibly be better as annotations
> >> (@MachineGenerated, @ManuallyCurated). This is true metadata.
> >>
> >> Probably if Java had annotations in version 1.1 Serializable would
> >> also be an Annotation.  I would agree with the idea that ThingBuilder
> >> etc should be typed on extends Serializable.
> >>
> >> - Mark
> >>
> >> On Tue, Oct 21, 2008 at 7:14 PM, Richard Holland
> >> <dicknetherlands at gmail.com> wrote:
> >>> For now, yes it's empty. But I can envisage situations where it might
> be
> >>> nice to have Thing implement some common methods (e.g.
> >> isMachineGenerated(),
> >>> isManuallyCurated(), etc.). I'd rather have it there now to be a
> >> placeholder
> >>> for future expansion, than have to re-engineer everything should we
> >> identify
> >>> a need for common functions in future.
> >>>
> >>> You'll see that Thing already extends Serializable, implying that all
> >> Things
> >>> must be able to persist to an object backing store. Serializable itself
> >> is
> >>> also an empty interface!
> >>>
> >>> Also I like the idea of having Thing, not Object, as a kind of marker
> of
> >>> intention. To me it makes it clearer when reading code to avoid Object
> >>> wherever possible. Thing may not be any more clever than Object, but it
> >>> immediately declares an intention when reading code as to what kind of
> >>> Object should be expected.
> >>>
> >>>
> >>> 2008/10/21 Mark Schreiber <markjschreiber at gmail.com>
> >>>> Is there any need for Thing at all? Can't a bulder be typed to produce
> >>>> something that extends Object?
> >>>>
> >>>> If Thing provides no behaivour contract or meta-information then why
> >>>> does it exist?
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Tue, Oct 21, 2008 at 4:49 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>>> Depends on what you want to program. If you want to have a collection
> >> of
> >>>>> objects which are Things & perform a common action on them then
> >>>>> annotations are not the way forward.
> >>>>>
> >>>>> If you want to have some kind of meta-programming occurring & need a
> >>>>> class to be multiple things then annotations are right. There is
> >>>>> currently no way to enforce compile time dependencies on annotations
> &
> >>>>> my thinking is that this is right. Annotations should be meta data or
> >>>>> provide a way to alter a class in a non-invasive way (think Web
> >> Service
> >>>>> annotations creating WS Servers & Clients without any alteration of
> >> the
> >>>>> class).
> >>>>>
> >>>>> Andy
> >>>>>
> >>>>> Richard Holland wrote:
> >>>>>> Spot on.
> >>>>>>
> >>>>>> Annotation/interface.... i think Annotation is probably better as
> you
> >>>>>> suggest, but I'd have to look into that. Not sure how it works with
> >>>>>> collections and generics. If it does turn out to be a better bet,
> >> I'll
> >>>>>> change it over.
> >>>>>>
> >>>>>> With the BioSQL dependencies, take a look at the pom.xml file inside
> >>>>>> the
> >>>>>> biojava-dna module. It declares a dependency on biojava-core. If you
> >>>>>> want to
> >>>>>> add dependencies to external JARs, take a look at biojava-biosql's
> >>>>>> pom.xml
> >>>>>> to see how it depends on javax.persistence. (The easiest way to add
> >>>>>> these is
> >>>>>> via an IDE such as NetBeans, which is what I'm using at the moment).
> >>>>>>
> >>>>>> cheers,
> >>>>>> Richard
> >>>>>>
> >>>>>> 2008/10/21 Mark Schreiber <markjschreiber at gmail.com>
> >>>>>>
> >>>>>>> So if I want to build a BioSQL loader from Genbank then would the
> >>>>>>> classes (or there wrappers) in the BioSQL Entity package need to
> >>>>>>> implement Thing?  Would maven have an issue with that or would it
> >> just
> >>>>>>> create a dependency on core? (you can tell I've never used Maven
> >>>>>>> right).
> >>>>>>>
> >>>>>>> From a design point of view should Thing be an interface or an
> >>>>>>> Annotation? The reason I ask is that it doesn't define any methods
> >> so
> >>>>>>> it is more of a tag than an interface.
> >>>>>>>
> >>>>>>> Anyway, my understanding is that I would use a Genbank parser (or
> >>>>>>> write one). Write a EntityReceiver interface (probably more than
> one
> >>>>>>> given the number of entities in BioSQL, implement a EntityBuilder
> >>>>>>> (again possibly more than one) that implements EntityReceiver and
> >>>>>>> builds Entity beans from messages it receives. In this case I
> >> probably
> >>>>>>> wouldn't provide a writer as JPA would be writing the beans to the
> >>>>>>> database.  Would this be how you imagine it?
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Oct 21, 2008 at 1:52 AM, Richard Holland
> >>>>>>> <holland at eaglegenomics.com> wrote:
> >>>>>>>> (From now on I will only be posting these development messages to
> >>>>>>>> biojava-dev, which is the intended purpose of that list. Those of
> >> you
> >>>>>>>> who
> >>>>>>>> wish to keep track of things but are currently only subscribed to
> >>>>>>> biojava-l
> >>>>>>>> should also subscribe to biojava-dev in order to keep up to date.)
> >>>>>>>>
> >>>>>>>> As promised, I've committed a new package in the biojava-core
> >> module
> >>>>>>>> that
> >>>>>>>> should help understand how to do file parsing and conversion and
> >>>>>>>> writing
> >>>>>>> in
> >>>>>>>> the new BJ3 modules. Here's an example of how to use it to write a
> >>>>>>> Genbank
> >>>>>>>> parser (note no parsers actually exist yet!):
> >>>>>>>>
> >>>>>>>> 1. Design yourself a Genbank class which implements the interface
> >>>>>>>> Thing
> >>>>>>> and
> >>>>>>>> can fully represent all the data that might possibly occur inside
> a
> >>>>>>> Genbank
> >>>>>>>> file.
> >>>>>>>>
> >>>>>>>> 2. Write an interface called GenbankReceiver, which extends
> >>>>>>>> ThingReceiver
> >>>>>>>> and defines all the methods you might need in order to construct a
> >>>>>>> Genbank
> >>>>>>>> object in an asynchronous fashion.
> >>>>>>>>
> >>>>>>>> 3. Write a GenbankBuilder class which implements GenbankReceiver
> >> and
> >>>>>>>> ThingBuilder. It's job is to receive data via method calls, use
> >> that
> >>>>>>>> data
> >>>>>>> to
> >>>>>>>> construct a Genbank object, then provide that object on demand.
> >>>>>>>>
> >>>>>>>> 4. Write a GenbankWriter class which implements GenbankReceiver
> and
> >>>>>>>> ThingWriter. It's job is similar to GenbankBuilder, but instead of
> >>>>>>>> constructing new Genbank objects, it writes Genbank records to
> file
> >>>>>>>> that
> >>>>>>>> reflect the data it receives.
> >>>>>>>>
> >>>>>>>> 5. Write a GenbankReader class which implements ThingReader. It
> can
> >>>>>>>> read
> >>>>>>>> GenbankFiles and output the data to the methods of the
> >> ThingReceiver
> >>>>>>>> provided to it, which in this case could be anything which
> >> implements
> >>>>>>>> the
> >>>>>>>> interface GenbankReceiver.
> >>>>>>>>
> >>>>>>>> 6. Write a GenbankEmitter class which implements ThingEmitter. It
> >>>>>>>> takes a
> >>>>>>>> Genbank object and will fire off data from it to the provided
> >>>>>>> ThingReceiver
> >>>>>>>> (a GenbankReceiver instance) as if the Genbank object was being
> >> read
> >>>>>>>> from
> >>>>>>> a
> >>>>>>>> file or some other source.
> >>>>>>>>
> >>>>>>>> That's it! OK so it's a minimum of 6 classes instead of the
> >> original
> >>>>>>>> 1 or
> >>>>>>> 2,
> >>>>>>>> but the additional steps are necessary for flexibility in
> >> converting
> >>>>>>> between
> >>>>>>>> formats.
> >>>>>>>>
> >>>>>>>> Now to use it (you'll probably want a GenbankTools class to wrap
> >>>>>>>> these
> >>>>>>> steps
> >>>>>>>> up for user-friendliness, including various options for opening
> >>>>>>>> files,
> >>>>>>>> etc.):
> >>>>>>>>
> >>>>>>>> 1. To read a file - instantiate ThingParser with your
> GenbankReader
> >>>>>>>> as
> >>>>>>> the
> >>>>>>>> reader, and GenbankBuilder as the receiver. Use the iterator
> >> methods
> >>>>>>>> on
> >>>>>>>> ThingParser to get the objects out.
> >>>>>>>>
> >>>>>>>> 2. To write a file - instantiate ThingParser with a GenbankEmitter
> >>>>>>> wrapping
> >>>>>>>> your Genbank object, and a GenbankWriter as the receiver. Use the
> >>>>>>> parseAll()
> >>>>>>>> method on the ThingParser to dump the whole lot to your chosen
> >>>>>>>> output.
> >>>>>>>>
> >>>>>>>> The clever bit comes when you want to convert between files.
> >> Imagine
> >>>>>>> you've
> >>>>>>>> done all the above for Genbank, and you've also done it for FASTA.
> >>>>>>>> How to
> >>>>>>>> convert between them? What you need to do is this:
> >>>>>>>>
> >>>>>>>> 1. Implement all the classes for both Genbank and FASTA.
> >>>>>>>>
> >>>>>>>> 2. Write a GenbankFASTAConverter class that implements
> >>>>>>> ThingConverter<FASTA>
> >>>>>>>> and GenbankReceiver, and will internally convert the data received
> >>>>>>>> and
> >>>>>>> pass
> >>>>>>>> it on out to the receiver provided, which will be a FASTAReceiver
> >>>>>>> instance.
> >>>>>>>> 3. Write a FASTAGenbankConverter class that operates in exactly
> the
> >>>>>>> opposite
> >>>>>>>> way, implementing ThingConverter<Genbank> and FASTAReceiver.
> >>>>>>>>
> >>>>>>>> Then to convert you use ThingParser again:
> >>>>>>>>
> >>>>>>>> 1. From FASTA file to Genbank object: Instantiate ThingParser with
> >> a
> >>>>>>>> FASTAReader reader, a GenbankBuilder receiver, and add a
> >>>>>>>> FASTAGenbankConverter instance to the converter chain. Use the
> >>>>>>>> iterator
> >>>>>>> to
> >>>>>>>> get your Genbank objects out of your FASTA file.
> >>>>>>>>
> >>>>>>>> 2. From FASTA file to Genbank file: Same as option 1, but provide
> a
> >>>>>>>> GenbankWriter instead and use parseAll() instead of the iterator
> >>>>>>>> methos.
> >>>>>>>>
> >>>>>>>> 3. From FASTA object to Genbank object: Same as option 1, but
> >> provide
> >>>>>>>> a
> >>>>>>>> FASTAEmitter wrapping your FASTA object as the reader instead.
> >>>>>>>>
> >>>>>>>> 4. From FASTA object to Genbank file: Same as option 1, but swap
> >> both
> >>>>>>>> the
> >>>>>>>> reader and the receiver as per options 2 and 3.
> >>>>>>>>
> >>>>>>>> 5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all
> >>>>>>> mentions
> >>>>>>>> of FASTA and Genbank, and use GenbankFASTAConverter instead.
> >>>>>>>>
> >>>>>>>> One last and very important feature of this approach is that if
> you
> >>>>>>> discover
> >>>>>>>> that nobody has written the appropriate converter for your chosen
> >>>>>>>> pair of
> >>>>>>>> formats A and C, but converters do exist to map A to some other
> >>>>>>>> format B
> >>>>>>> and
> >>>>>>>> that other format B on to C, then you can just put the two
> converts
> >>>>>>>> A-B
> >>>>>>> and
> >>>>>>>> B-C into the ThingParser chain and it'll work perfectly.
> >>>>>>>>
> >>>>>>>> Enjoy!
> >>>>>>>>
> >>>>>>>> cheers,
> >>>>>>>> Richard
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Richard Holland, BSc MBCS
> >>>>>>>> Finance Director, Eagle Genomics Ltd
> >>>>>>>> M: +44 7500 438846 | E: holland at eaglegenomics.com
> >>>>>>>> http://www.eaglegenomics.com/
> >>>>>>>> _______________________________________________
> >>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>
> >>>
> >>> --
> >>> Richard Holland, BSc MBCS
> >>> Finance Director, Eagle Genomics Ltd
> >>> M: +44 7500 438846 | E: holland at eaglegenomics.com
> >>> http://www.eaglegenomics.com/
> >>>
> >
> >
> >
>



-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the biojava-dev mailing list