[Biojava-l] File parsing in BJ3

Richard Holland holland at eaglegenomics.com
Mon Oct 20 17:52:08 UTC 2008


(From now on I will only be posting these development messages to
biojava-dev, which is the intended purpose of that list. Those of you who
wish to keep track of things but are currently only subscribed to biojava-l
should also subscribe to biojava-dev in order to keep up to date.)

As promised, I've committed a new package in the biojava-core module that
should help understand how to do file parsing and conversion and writing in
the new BJ3 modules. Here's an example of how to use it to write a Genbank
parser (note no parsers actually exist yet!):

1. Design yourself a Genbank class which implements the interface Thing and
can fully represent all the data that might possibly occur inside a Genbank
file.

2. Write an interface called GenbankReceiver, which extends ThingReceiver
and defines all the methods you might need in order to construct a Genbank
object in an asynchronous fashion.

3. Write a GenbankBuilder class which implements GenbankReceiver and
ThingBuilder. It's job is to receive data via method calls, use that data to
construct a Genbank object, then provide that object on demand.

4. Write a GenbankWriter class which implements GenbankReceiver and
ThingWriter. It's job is similar to GenbankBuilder, but instead of
constructing new Genbank objects, it writes Genbank records to file that
reflect the data it receives.

5. Write a GenbankReader class which implements ThingReader. It can read
GenbankFiles and output the data to the methods of the ThingReceiver
provided to it, which in this case could be anything which implements the
interface GenbankReceiver.

6. Write a GenbankEmitter class which implements ThingEmitter. It takes a
Genbank object and will fire off data from it to the provided ThingReceiver
(a GenbankReceiver instance) as if the Genbank object was being read from a
file or some other source.

That's it! OK so it's a minimum of 6 classes instead of the original 1 or 2,
but the additional steps are necessary for flexibility in converting between
formats.

Now to use it (you'll probably want a GenbankTools class to wrap these steps
up for user-friendliness, including various options for opening files,
etc.):

1. To read a file - instantiate ThingParser with your GenbankReader as the
reader, and GenbankBuilder as the receiver. Use the iterator methods on
ThingParser to get the objects out.

2. To write a file - instantiate ThingParser with a GenbankEmitter wrapping
your Genbank object, and a GenbankWriter as the receiver. Use the parseAll()
method on the ThingParser to dump the whole lot to your chosen output.

The clever bit comes when you want to convert between files. Imagine you've
done all the above for Genbank, and you've also done it for FASTA. How to
convert between them? What you need to do is this:

1. Implement all the classes for both Genbank and FASTA.

2. Write a GenbankFASTAConverter class that implements ThingConverter<FASTA>
and GenbankReceiver, and will internally convert the data received and pass
it on out to the receiver provided, which will be a FASTAReceiver instance.

3. Write a FASTAGenbankConverter class that operates in exactly the opposite
way, implementing ThingConverter<Genbank> and FASTAReceiver.

Then to convert you use ThingParser again:

1. From FASTA file to Genbank object: Instantiate ThingParser with a
FASTAReader reader, a GenbankBuilder receiver, and add a
FASTAGenbankConverter instance to the converter chain. Use the iterator to
get your Genbank objects out of your FASTA file.

2. From FASTA file to Genbank file: Same as option 1, but provide a
GenbankWriter instead and use parseAll() instead of the iterator methos.

3. From FASTA object to Genbank object: Same as option 1, but provide a
FASTAEmitter wrapping your FASTA object as the reader instead.

4. From FASTA object to Genbank file: Same as option 1, but swap both the
reader and the receiver as per options 2 and 3.

5/6/7/8. From Genbank * to FASTA * - same as 1,2,3,4 but swap all mentions
of FASTA and Genbank, and use GenbankFASTAConverter instead.

One last and very important feature of this approach is that if you discover
that nobody has written the appropriate converter for your chosen pair of
formats A and C, but converters do exist to map A to some other format B and
that other format B on to C, then you can just put the two converts A-B and
B-C into the ThingParser chain and it'll work perfectly.

Enjoy!

cheers,
Richard

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the Biojava-l mailing list