[Biojava-l] [newio] Code landed

Thomas Down td2@sanger.ac.uk
Fri, 17 Nov 2000 18:12:45 +0000


--sm4nu43k4a2Rpi4c
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I've just checked in the first revision of my new sequence I/O
implementation.  There's still more work left to be done, but
hopefully most of the framework is now in place.  Please everyone
test this, read the code, shout at me if I've got something wrong,
etc., etc.

What's new:

  - Event-notification based sequence input, with full
    decoupling of the parsing from Sequence object creation.

  - A standard way to filter sequence and feature-table
    data as it is read into BioJava -- just implement the
    SequenceBuilder interface (see FastaDescriptionLineParser
    and EmblProcessor for examples)

  - Faster and more memory-efficient parsing of large sequences.

  - The irritating FASTA line-length bug dead and gone 
    forever :).

What's currently missing:

  - No GENBANK parser.  If anyone else wants to take this
    on, feel free (look at the new EmblLikeFormat and
    EmblProcessor classes for ideas), otherwise I'll try
    to revive the old implementation.

  - IndexedSequenceDB was clobbered by one of the internal
    API changes -- it's not a hard fix, but I've temporarily
    disabled it until we've worked out the neatest way to fit
    this functionality onto the new framework.

How to use it:

>From the outside, I've tried to make the minimum possible API
changes.  If you just use the I/O framework via the StreamReader
class, the only major change you'll see if that you now need to
provide a SequenceBuilderFactory in place of the old SequenceFactory.
The `standard' implementation is at SimpleSequenceBuilder.FACTORY.
But in practice, you may want to wrap this up in one or more extra
layers of sequence processing.

As a quick example, I've attached a newio version of the GCContent
demo program.  I'm in the process of updating the other demo programs
in the repository.

For people who were previously using EmblParser, this has now
been replaced by a lighter-weight EmblLikeParser (which should
also work for formats like SwissProt, Transfac, UTRdb, and
so on).  Output from this is converted into something resembling
the old parser using the EmblProcessor filter class.

Happy hacking!

   Thomas.


PS. For anyone who wants a copy of the last BioJava without newio,
    a checkout at 17:00 UTC today should be safe
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett

--sm4nu43k4a2Rpi4c
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="GCContent.java"

package seq;

import java.io.*;

import org.biojava.bio.seq.io.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class GCContent {
    public static void main(String[] args)
        throws Exception
    {
        if (args.length != 1)
	    throw new Exception("usage: java GCContent filename.fa");
	String fileName = args[0];
       
	// Set up stream reader

	Alphabet dna = DNATools.getDNA();
	SymbolParser dnaParser = dna.getParser("token");
	BufferedReader br = new BufferedReader(
			        new FileReader(fileName));
	SequenceBuilderFactory fact = new FastaDescriptionLineParser.Factory(
					      SimpleSequenceBuilder.FACTORY);
	StreamReader stream = new StreamReader(br,
					       new FastaFormat(),
					       dnaParser,
					       fact);

	// Iterate over all sequences in the stream

	while (stream.hasNext()) {
	    Sequence seq = stream.nextSequence();
	    int gc = 0;
	    for (int pos = 1; pos <= seq.length(); ++pos) {
		Symbol sym = seq.symbolAt(pos);
		if (sym == DNATools.g() || sym == DNATools.c())
		    ++gc;
	    }
	    System.out.println(seq.getName() + ": " + 
			       ((gc * 100.0) / seq.length()) + 
			       "%");
	}
    }			       
}

--sm4nu43k4a2Rpi4c--