[Biojava-dev] Sequence interface - exceptions

Tue Jul 20 15:25:36 UTC 2010

Mark makes some very good points and it will be a challenge to come up with a robust(appropriate) error reporting and still maintain flexibility where writing code is easy as long as everything works. Currently, you can pass a class that implements the Sequence Interface to the constructor of a DNASequence, ProteinSequence etc. If the class that implements the sequence interface throws an exception when it is created then that is outside the api design of the abstract sequence. In the following example UniprotProxySequenceReader upon creation would call the appropriate URL and retrieve the sequence. If an error occurs then that class should throw the appropriate exception. We don't need to force a particular exception on classes that implement an interface.

            UniprotProxySequenceReader<AminoAcidCompound> uniprotSequence = new UniprotProxySequenceReader<AminoAcidCompound>("YA745_GIBZE", AminoAcidCompoundSet.getAminoAcidCompoundSet());
            ProteinSequence proteinSequence = new ProteinSequence(uniprotSequence);

We do have an api/exception design problem if UniprotProxySequenceReader does lazy instantiation where it doesn't retrieve the sequence data unless a call to proteinSequence.getSequence() is made. This allows us to create applications where you can load a large number of sequences without consuming memory or sequences that will never be used. If you have a web based application where the user will query a sequence based on some event then this is a nice design element. If you are writing code to exam the GC content of every gene sequence then not a big memory saver.

The easy solution is to have every sequence method that has a dependency on a class with sequence interface declared throws exception. This would add additional exception handling code for the users of the api which can add to the complexity and introduce a performance penalty if the try catch is not done generally for a block of code.

The reality is that for the X number of methods that have a dependency on a Sequence Interface class if one fails they will all fail. We could add an isInit() method to AbstractSequence which throws an exception or returns a boolean that is designed to force the Sequence Interface to load sequence data from external sources. The user of the API via our contract definition can do defensive programming and make sure the sequence is ready before using it. If it is not ready and a method is called that depends on the Sequence Interface then we simply return the appropriate null/not defined object.

The last use case that still makes this difficult is being able to define a ChromosomeSequence(new NCBISequenceReader("NC_000019.9")) where a call to get a collection of gene sequences from the chromosome sequence to be done in a lazy fashion without retrieving the entire chromosome sequence. If I make a call to geneSequence.getProteinSequence().toString() then that would make the appropriate getSubString(2000,5000) that maps to the gene to the NCBISequenceReader which then retrieves that sub sequence from NCBI. To allow this option we can not depend on the isInit() to be correct. In this particular example we have three types of errors. The internet connection is not working, NCBI is not working or refusing your connection because you went over the three requests per second rule or you have something wrong with your accession id. If the internet is down or NCBI is refusing your connection not a great deal the application can do to recover. In the case of the accession id being an error that could be handled when you instantiate the class new NCBISequenceReader("NC_000019.9") by some sort of call to NCBI to see if it is valid and if not throw an exception.

We do have options when a particular service is down or slow to respond. Uniprot implements a DNS based load distribution that I did have a problem with one weekend. It was very slow and often did not respond. Turns out if I changed my URL I could point to the http://pir.uniprot.org located in the US and everything worked great. This could be something implemented by UniprotProxySequenceReader if it gets an IO exception or determines queries are taking a long time.

In summary we probably should throw exceptions for each method that depends on Sequence Interface and/or return a set of appropriate null/not-init objects. Given that we are working with imperfect data models and data relationships I think defensive programming on return values is not a bad option. It is a shame to have getSequenceLength() throw an exception or return a null Integer if an IO exception occurs. These are only problems when using a Sequence Interface that has a higher risk of failure because it is remote and would be the "exception" not the rule. For hard core developers we can resolve these issues when they occur. If the Biojava-core code makes it way into an end user application then we need to give the application developer a way to deal with error conditions.  Using the NCBI chromosome example I think we can create a very powerful api to work with large amounts of sequence data but at the expense of making the api very exception happy!

We have also begun the very exciting step of doing wiki docs specific to Biojava3, It is a work in progress http://biojava.org/wiki/BioJava:CookBook3.0

Thanks

Scooter

On Jul 20, 2010, at 5:46 AM, <jake at researchtogether.com<mailto:jake at researchtogether.com>> wrote:

See comments in line.

Thanks,
Jake

On Tue, Jul 20, 2010 at 10:20:10AM +0800, Mark Schreiber wrote:
I don't think it is a great idea to hide IO exceptions but you can
reduce the burden of them.

I would normally agree with you, but as I shall point out later this will have a lot of knock on effects for the interface which may not be desirable.

You can copy the Groovy model which handles a lot of the
try/catch/finally boiler plate code for you. Basically you make a
helper class with methods to perform common IO operations and which
will do it's very best to connect, read/write and clean up.

You can also think about what might actually cause an error. If you
are reading from a local disk cache where the file address is known
(such as a temp file) you can very nearly guarantee that the IO
operation will succeed. So much so that you could rethrow an IO
Exception as an error because there is very little that can be done
about it (other than improving the cache code or getting more reliable
hard-drives).

And this is the issue - the Sequence interface is used by a lot of different readers, some are reading from disk, others from database and in my particular case I am reading it from a URL. Also, it is possible that I will run into a lot of exceptions around XML parsing (the data from the URL) as well as HTTP errors (page not found, service unavailable etc.)

Now, normally I would want to deal with some of the errors and only log them - e.g. a 503 I might retry a few times and if there is a problem with the XML I might try and fetch it again.

However, I don't fully understand how the caller will expect these SequenceReaders to behave which I why I asked the question :) An IOException on a file is probably fatal but IOException on a network call is possibly recoverable, or at least wort re-trying.

As for what can cause errors:
1. Invalid URL
2. Page(s) unavailable (4xx, 5xx)
3. Invalid/unexpected data returned (XML badly formed, FASTA invalid)
4. Change to service (if the service has changed and the parser is effectively broken)
5. Network interuptance (i.e. network timeout)

Reading a file from disk? The most likely problem is a incorrect file
name. Other problems can probably be turned into runtime exceptions
cause other problems are probably disk errors.

Reading from a URL, lots of things can go wrong here so you probably
need to expose all the possible exceptions.

I will work on this assumption and change the interface accordingly, though I expect that the decision will be re-visited.

Reading from SQL? Kind of depends on the expected DB availability and
latency. Also, if the query code (or JPA query) is coming from the
BioJava source then an error is appropriate (the developer can't do
much about the mistake). If the code is coming from the app developer
then you should notify them of SQL errors.

- Mark

On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland
<holland at eaglegenomics.com<mailto:holland at eaglegenomics.com>> wrote:

I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown.

SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure.

I don't know. There must be experts on this in the list who can help!

cheers,
Richard

On 19 Jul 2010, at 14:49, jake at researchtogether.com<mailto:jake at researchtogether.com> wrote:

Hi All,

I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview

One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException.

Any opinions? :)

Cheers,
Jake
_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org<mailto:biojava-dev at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/biojava-dev

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com<mailto:holland at eaglegenomics.com>
http://www.eaglegenomics.com/

_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org<mailto:biojava-dev at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/biojava-dev
_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org<mailto:biojava-dev at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/biojava-dev