[Biojava-dev] [Biojava-l] file i/o with ArrayList

Paolo Pavan paolo.pavan at gmail.com
Tue Feb 10 10:56:25 UTC 2015


Hi Stefan, thank you for the review.
You are actually surprising me since if I'm not sure that the reader parser
supports multiple genbank files catenated I tought instead that all the
info now are full filled in the sequence object.
There are just few tags that are not imported (KEYWORDS_TAG, SOURCE_TAG,
REFERENCE_TAG, BASE_COUNT_TAG), the documentation says that this is because
they are anyway inferrable by different fields. I can also add this is
because, as in the case of authors and reference tags, there is not such a
property in the AbstractSequence class and I see poor sense to have it,
unless you are doing this sort of swapping job. Anyway, it could be
certainly added.

About writer and its failure in writing the reported accession: even if I
can't go in deep now, it may well be that it is failing in writing
InsdcLocations (also known as split locations, for example
*join(58474..59052,59052..59279)* reported in your genome file) since they
have been used more sistematically by the last updated genbank reader. It
may need a quick review along with db_xref qualifiers as well.

In the end about alignment, if you are using NedlemanWunsch or such in the
alignment package, be sure to load them with the proper
AmbiguityDNACompoundSet.

Cheers,
Paolo

2015-02-10 10:28 GMT+01:00 stefan harjes <stefanharjes at yahoo.de>:

> Hi Paolo, biojava-dev
>
> I had a look myself. First I noticed, that GenbankWriter was actually more
> sophisticated than the Reader, as it was able to write more than one
> sequence. I submitted a pull request to patch GenbankReader which enables
> reading more than one genbank sequence from one file. When we speak of full
> Genbank reading capability, there are still at least 5 sectionKeys which
> are just ignored in the reader. I think there should be a way of simply
> storing them in a List and not asking for each one of them, maybe I will
> look there later.
>
> The writer is doing pretty well, but you should try to write
> 'NC_000913.gb' which crashed it in my case (writing nothing/no exception).
>
> I added two more test cases, but I think in order to really test the
> reader/writer capabilities we need a test where several sequences/proteins
> are read, merge into an array and written to stream. Upon reading this
> stream again, we should compare if they are still identical.
>
> Also I noticed, that you can not compare (align) a DNA sequence with non
> ambiguous nucleotide to a sequence with ambiguous nucleotide compounds even
> though a matrix dedicated for that exact comparison exists.
>
> Cheers
> Stefan
>
>
>   Paolo Pavan <paolo.pavan at gmail.com> schrieb am 4:43 Samstag, 7.Februar
> 2015:
>
>
> Hi Stefan,
> I had a look at the GenbankWriter because I could also need it in the
> future. Can you please specify what are the issues you are meeting? Because
> I made few quick tests and everything seemed work to me.
>
> Just in case, if you are reading then writing a Genbank file, are you
> using the last release of biojava 4.0.0 version? This would explain empty
> genbank files in output (If I have understood correctly what you have done).
>
> Paolo
>
> 2015-02-06 11:03 GMT+01:00 stefan harjes <stefanharjes at yahoo.de>:
>
> @Andreas: Yes I understand, thanks anyhow.
>
> @Paolo: I will have another look at GenbankWriter maybe I find some time.
>
> Cheers
> Stefan
>
>
>
>   Andreas Prlic <andreas at sdsc.edu> schrieb am 7:01 Freitag, 6.Februar
> 2015:
>
>
> Hi Stefan,
>
> thanks for your reply. You are trying to use the code base in a way that
> has not been done before. While I share your desire that this should work
> in principle, I think it is also important to point out that we never
> promised that serialization would be a supported feature. We started a
> thread to add better support on this here:
> https://github.com/biojava/biojava/issues/249 .
>
> Regarding your project: It seems it would make sense to split your array
> of sequences into two: DNA sequences and protein sequences. Dealing with
> each of those separately might be easier.
>
> Andreas
>
>
> On Wed, Feb 4, 2015 at 3:42 PM, stefan harjes <stefanharjes at yahoo.de>
> wrote:
>
> Hi Andreas,
>
> yes I took a look at FastaWriterHelper as well as GenbankWriter and they
> only seem to implement writing the name and sequence as fasta. Also they do
> not allow to read/write a mixed array of protein and DNA sequences. I asked
> myself what is the sense of constructing a complicated sequence with
> annotations, features and links, if I can only write fasta?
>
> This lead me to check out why one of the most basic classes of biojava
> like sequence (i.e. AbstractSequence) is not serializable.
> (Isn't it like String for java?)
>
> The first thing I noticed is that for some reason every sequence has a
> proxyloader. As fas as I understand the proxy is implemented in order to
> not load the entire sequence in case it is very big. Sure, then you can
> load sequences which have Gigabase length. But I have never in my 25 years
> of biochemistry actually worked with a single sequence of > 1GB. While
> there are some plant chromosomes which might fit this description, I would
> argue that the vast majority of biological sequences are much smaller and
> thus do not need a proxy for a single sequence. Thus, I would conclude that
> a small subset of ChromosomeSequence might need a proxyreader
> implementation.
> And thus it should be implemented there and not in the most basic class?
>
> The first class which prevents serialization is as you mentioned
> NucleotideCompound. I lack the biojava experience to say what is essential
> in NucleotideCompond and why it does not allow an empty constructor. But I
> saw for example in biojava 3.1 that compounds are allowed to have flexible
> name length, which I have never seen in actual sequence data, where it is
> always 1 or three characters. Is it not a better strategy to keep basic
> classes such as Sequence and Compound more basic in order to allow
> serialization. Implementation of more complex features could then be moved
> to classes which extend the basic classes?
>
> In my humble opinion one could instantiate a compound without a 'base'
> name but once this compound is added to the compound set, I could check
> that it actually has a base name?
>
> I do not want to sound like a know-it-all and do not try to reinvent
> biojava. However to be honest the (unsuccessful) effort in trying to
> serialize an ArrayList<Sequence<?>> either to send it around over TCP/IP,
> to JSON or to disk has been so frustrating and time consuming, that I
> actually consider changing to jython/biopython, biojavaX, or to write my
> own implementation.
>
> Cheers
> Stefan
>
>
>
>
>
>
>   Andreas Prlic <andreas at sdsc.edu> schrieb am 4:32 Donnerstag, 5.Februar
> 2015:
>
>
>
>
> Hi Stefan,
>
> just another quick follow up. You took a look at FastaWriterHelper and it
> was not useful, right? You need to serialize some header information as
> well, or what was the problem with it?
>
>
> http://www.biojava.org/docs/api/org/biojava/nbio/core/sequence/io/FastaWriterHelper.html
>
> Thanks,
>
> Andreas
>
>
> On Wed, Feb 4, 2015 at 7:13 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>
> Thanks for pointing this out, Stefan. The problem is that the
> NucleotideCompound class does not have a zero-args constructor. That means
> you need to tweak kryo a bit. Kryo can be configured to use an
> InstantiatorStrategy to handle creating instances of a class.
> https://github.com/EsotericSoftware/kryo/blob/master/README.md
>
> Having said that, we need to improve the API and make something like this
> easier.
>
> Andreas
>
>
>
> On Wed, Feb 4, 2015 at 2:54 AM, stefan harjes <stefanharjes at yahoo.de>
> wrote:
>
> I finally had some time to try the serialization/deserialization library
> (Kryo) you mentioned, but I do not seem to get it to work. I can not even
> save a DNASequence:
>
> void test() {
>     Kryo kryo = new Kryo();
>     DNASequence dna=null;
>     try {
>         dna = new DNASequence("AGCT");
>     } catch (CompoundNotFoundException e1) {
>         // TODO Auto-generated catch block
>         e1.printStackTrace();
>     }
>     try {
>         Output output = new Output(new FileOutputStream("test.ser"));
>          kryo.writeObject(output, dna);
>         output.close();
>     } catch (FileNotFoundException e) {
>         // TODO Auto-generated catch block
>         e.printStackTrace();
>     }
>     try {
>         Input input = new Input(new FileInputStream("test.ser"));
>         dna = kryo.readObject(input, DNASequence.class);
>         input.close();
>     } catch (FileNotFoundException e) {
>         // TODO Auto-generated catch block
>         System.out.println("file not found");
>         e.printStackTrace();
>     }
> }
> I tried several calls of Kryo and also registration, but I can not get it
> to work.... Any ideas?
>
>
> Cheers
> Stefan
>
>
>   Andreas Prlic <andreas at sdsc.edu> schrieb am 3:47 Samstag, 31.Januar
> 2015:
>
>
> Hi Stefan,
>
> for your use case (save and load at server start/stop) I'd recommend the
> Kryo library.  It will store your data as a binary. Should be only two
> lines of code each to persist and load the data.
> https://github.com/EsotericSoftware/kryo
>
> You are right, writing is not very well developed, but then there are so
> many utility libraries in Java that can be used for efficient
> serialization/deserialization in many ways, once you have an object in
> memory.
>
> Andreas
>
>
>
> On Fri, Jan 30, 2015 at 3:01 AM, stefan harjes <stefanharjes at yahoo.de>
> wrote:
>
> Hi biojava-l
>
>
>
> I have a huge number of small sequences in an Array
> (ListArray<Sequence<?>>) which for server start and stop I would like to
> store on disk. Unfortunately Sequence is not serilizable, so I searched and
> found that GenbankWriterHelper.writeSequences(OutputStream os,
> Collection<Sequence<?>> seqs) should be able to do the job.
> However when looking at GenbankReaderHelper, there are no methods which
> correspond to the above writer method. Am I on the wrong track completely?
>
> When looking at the writer/reader helpers, I think I remember reading that
> they are rudimentary and save only the sequence (fasta)? I would expect in
> such an advanced verision of biojava (4.0 is being prepared?) that there
> must be a standard way to serialize rich sequences/arrays of them in order
> to send them around on streams/Json etc?
>
> Any help would be appreciated
>
> Cheers
> Stefan
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biojava-dev
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20150210/b08d8e59/attachment-0001.html>


More information about the biojava-dev mailing list