<p dir="ltr">I'm the author of the genbank writer and would greatly appreciate fixes/improvements... I initially ported it from BioPython and simply kept their function structure intact. I focused on the portions my application needed, so there are likely gaps. I'm sure there are bugs as well, and likely can help look over patches to it. </p>
<p dir="ltr">Thanks in advance and sorry for any issues/bugs that may exist in the code...</p>
<div class="gmail_quote">On Feb 12, 2015 6:06 AM, "Paolo Pavan" <<a href="mailto:paolo.pavan@gmail.com">paolo.pavan@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Stefan,<br><div><div class="gmail_quote"><div dir="ltr"><div><div><div><br>Don't misunderstand me! Anyone effort generally is more than welcome by lead developers and yes, the design is not simple and we have already talked about the idea of a refactoring that probably will be rediscussed and planned in future. Just try to be more graceful! <br></div><br></div><div>In the specific: <br></div><div>- patch for multiple loading is great (twin GenbankProxySequenceLoader still needs to be verified, if you want). <br></div><div>- Just review the tags you added, I added one note directly in github. Fell free to add examples in test files if you are conviced.<br></div><div>- about references, database and so, new properties need to be added in AbstractSequence, ask Andreas if there are concern with this<br></div><div>- the Writer subsystem needs to be verified according to recent modifications in loaders. My suspect is that will fail in situation described above, but will not take long to review. I you would like to do this I'm sure that Andreas will be very happy to include your patches<br></div><div>- You're right, initial coordinates seems to need a fix. Moreover the ordering of qualifiers is shuffled respect to the original, but I don't think this is an issue when reloading the file.<br></div><div><br></div><div>This will fix your original use case with limited work, in my opinion.<br></div><div><br></div>Bye bye and thank you if you will decide to use your time.<span><font color="#888888"><br>Paolo<br></font></span></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">2015-02-12 2:32 GMT+01:00 stefan harjes <span dir="ltr"><<a href="mailto:stefanharjes@yahoo.de" target="_blank">stefanharjes@yahoo.de</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="color:#000;background-color:#fff;font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"><div><div><div style="background-color:rgb(255,255,255)"><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span>Hi Paolo, Andreas,</span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"><span></span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span>I am sorry if I sounded disrespectful. <br clear="none"></span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span><br clear="none"></span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span>I would like to point out that a new user gets confused by what has been published about the biojava library. There are several places where you strongly indicate that concatenated sequences are the intended design (see citations below).</span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">I
interpreted the below citations, that the library is aimed at concatenated
sequences. However the actual reading reads only one record. I thought it
to be very helpful to correct this discrepancy and submit a patch. Usually patches are reviewed by a maintainer who either accepts or rejects the pull request. And I would like to mention, that I also spent a
lot of time to understand and correct the issue.<br clear="none"></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">I would be glad if you find a way how to include contributions.</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">I would also like to mention, that there are errors during Genbank reading/writing. <span>When I compare an original Genbank sequence to one which has been first read and then written, I can see that there are several differences between the two files. </span><span>The most urgent of which is that </span>the Location
start of each feature is incremented by
one for each read/write cycle. There are also some minor issues like: the version field is shortened, references and organism are dropped, keywords and source are not copied etc. So it seems you are in need of additional contributions.</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"></div>citations:<br clear="none"><span></span><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span><br clear="none"></span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span>from the cookbook:</span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"></div><pre style="color:rgb(0,0,0);font-family:monospace;font-size:16px">        <span style="color:#666666;font-style:italic">/*
         * Method 2: With the GenbankReaderHelper
         */</span>
        <span style="color:#666666;font-style:italic">//Try with the GenbankReaderHelper</span>
        <a rel="nofollow" shape="rect" href="http://www.google.com/search?hl=en&q=allinurl%3Afile+java.sun.com&btnI=I%27m%20Feeling%20Lucky" target="_blank"><span style="color:#003399">File</span></a> dnaFile <span style="color:#339933">=</span> <span style="color:#000000;font-weight:bold">new</span> <a rel="nofollow" shape="rect" href="http://www.google.com/search?hl=en&q=allinurl%3Afile+java.sun.com&btnI=I%27m%20Feeling%20Lucky" target="_blank"><span style="color:#003399">File</span></a><span style="color:#009900">(</span><span style="color:#0000ff">"src/test/resources/NM_000266.gb"</span><span style="color:#009900">)</span><span style="color:#339933">;</span>                
        <a rel="nofollow" shape="rect" href="http://www.google.com/search?hl=en&q=allinurl%3Afile+java.sun.com&btnI=I%27m%20Feeling%20Lucky" target="_blank"><span style="color:#003399">File</span></a> protFile <span style="color:#339933">=</span> <span style="color:#000000;font-weight:bold">new</span> <a rel="nofollow" shape="rect" href="http://www.google.com/search?hl=en&q=allinurl%3Afile+java.sun.com&btnI=I%27m%20Feeling%20Lucky" target="_blank"><span style="color:#003399">File</span></a><span style="color:#009900">(</span><span style="color:#0000ff">"src/test/resources/BondFeature.gb"</span><span style="color:#009900">)</span><span style="color:#339933">;</span>
        LinkedHashMap<span style="color:#339933"><</span>String, DNASequence<span style="color:#339933">></span> dnaSequences <span style="color:#339933">=</span> GenbankReaderHelper.<span style="color:#006633">readGenbankDNASequence</span><span style="color:#009900">(</span> dnaFile <span style="color:#009900">)</span><span style="color:#339933">;</span>
        <span style="color:#000000;font-weight:bold">for</span> <span style="color:#009900">(</span>DNASequence sequence <span style="color:#339933">:</span> dnaSequences.<span style="color:#006633">values</span><span style="color:#009900">(</span><span style="color:#009900">)</span><span style="color:#009900">)</span> <span style="color:#009900">{</span>
                 <a rel="nofollow" shape="rect" href="http://www.google.com/search?hl=en&q=allinurl%3Asystem+java.sun.com&btnI=I%27m%20Feeling%20Lucky" target="_blank"><span style="color:#003399">System</span></a>.<span style="color:#006633">out</span>.<span style="color:#006633">println</span><span style="color:#009900">(</span> sequence.<span style="color:#006633">getSequenceAsString</span><span style="color:#009900">(</span><span style="color:#009900">)</span> <span style="color:#009900">)</span><span style="color:#339933">;</span>
        <span style="color:#009900">}</span>
</pre><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span> </span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span>without knowing the contents of 'NM_000266.gb' the reader must assume, that there are several sequences in the file as first: </span><span></span><span>The LinkedHashMap is called 'dnaSequences" with emphasis on the plural. Second if you read only one DNASequence why would you have a LinkedHashMap and why would you loop over one! sequence? Correct me if I am wrong, but in my opinion the cookbook expects concatenated sequences per single file.</span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"><span></span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><span>For non concatenated sequences speaks, that the method itself is named 'readGenbankDNASequences'. So I looked into the method to gain more clarity.<br clear="none"></span></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br clear="none"></div><div dir="ltr"><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">from the source code of GenbankReader:</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">/**</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">* This method tries to parse maximum <code>max</code> records from</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">* the open File or InputStream, and leaves the underlying resource open.<br></div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">.</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">.</div><div dir="ltr" style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px">.</div><div style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px"><br></div><div dir="ltr">The introducing comment of the method clearly speaks of multiple records. The</div><div dir="ltr">method is called with a parameter 'max=-1' to indicate that all records of the</div><div dir="ltr">file should be read. Interestingly the parameter max is not mentioned again in the following code </div><div dir="ltr">and thus not implemented. </div><div dir="ltr"><br></div><div dir="ltr">So do you not agree, that the design discussion of whether or not concatenated</div><div dir="ltr">sequence files are expected is not decided in your library?</div><div dir="ltr"><br></div><div style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px" dir="ltr">Best regards</div><div style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px" dir="ltr">Stefan</div><div style="color:rgb(0,0,0);font-family:HelveticaNeue,'Helvetica Neue',Helvetica,Arial,'Lucida Grande',sans-serif;font-size:16px" dir="ltr"><br></div></div><div style="display:block"> <div style="font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"> <div style="font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"> <div dir="ltr"> <font face="Arial"> Andreas Prlic <<a href="mailto:andreas@sdsc.edu" target="_blank">andreas@sdsc.edu</a>> schrieb am 9:10 Donnerstag, 12.Februar 2015:<br clear="none"> </font> </div><div><div> <br clear="none"><br clear="none"> <div><br><br></div><div><div><div><div><div dir="ltr">Thanks, Paolo, I would like to second what you said. <div><br clear="none"></div><div>We don't have many rules at BioJava, but "respect the work of others" is one that we used to send out to developers in the past before we granted SVN write access. I would like to put up this rule somewhere where it can be noticed. Nobody gets paid to contribute to BioJava and we are depending on contribution made on a volunteer basis. As such it is important to show respect for the work that people contributed previously. This does not mean we can't change things, it just means we do it in a way where everybody can feel respected.</div><div><br clear="none"></div><div>Andreas<br clear="none"><div><br clear="none"><div><br clear="none"></div><div><br clear="none"></div><div><div><br clear="none"><div>On Wed, Feb 11, 2015 at 3:12 AM, Paolo Pavan <span dir="ltr"><<a rel="nofollow" shape="rect" href="mailto:paolo.pavan@gmail.com" target="_blank">paolo.pavan@gmail.com</a>></span> wrote:<br clear="none"><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div>Hi Stefan,<br clear="none"></div>I don't want to talk about the design of the system. I know what you mean because I have already had the need to work on that and, yes, there are some choices that I would have not taken. But my choice, entering in a collaborative project, was to mantain the planned design to be sure to add some extra features without run the risk of getting lost in huge modifications, minimize the impact of modifications on the API and moreover to have some kind of "respect" for the work of people before me.<br clear="none">Unless the system works in reasonable time, in my opinion this is the important thing. <br clear="none"><br clear="none">Moreover be aware that there is also a parallel GenbankProxyLoader system that I'm pretty sure that it is not receiving your patches (catenated sequence loading). To be checked.<br clear="none"><br clear="none"></div>About section keys, unknown tags and collecting them in a list, I partially agree with you (more no than yes) because the Genbank format is highly specified and formal, please have a look at the link below. The best think would have to throw an Exception in unsupported cases but this would definitely limit the parser usage.<br clear="none"><span><a rel="nofollow" shape="rect" href="http://www.insdc.org/files/feature_table.html" target="_blank">http://www.insdc.org/files/feature_table.html</a></span><br clear="none"><br clear="none"></div>About reference, authors and dblink, as prevously mentioned, I have no exceptions to add those properties to AbstractSequence even if I feel they make poor sense outside your specific use case.<br clear="none"></div>Andreas has the last word on this, however.<br clear="none"><br clear="none"></div>My suggestion is just to carefully check your new added tags since it seems to me that <span>SOURCE_TAG is the same of your </span><span>DBSOURCE</span><span> but I know that this topic is also debated in the others bio* projects so it is not trivial.<br clear="none"><br clear="none"></span></div><span>Happy biojava-ing,<br clear="none"></span></div><span>Paolo<br clear="none"></span><div><div><span><br clear="none"><br clear="none"></span> <span></span><div><div><div><div><br clear="none"><br clear="none"></div></div></div></div></div></div></div><div><div><div><br clear="none"><div>2015-02-11 0:13 GMT+01:00 stefan harjes <span dir="ltr"><<a rel="nofollow" shape="rect" href="mailto:stefanharjes@yahoo.de" target="_blank">stefanharjes@yahoo.de</a>></span>:<br clear="none"><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="color:#000;background-color:#fff;font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"><div dir="ltr"><span>Hi Paolo,</span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>come on now the reader returns a LinkedHashMap. If it would only read one sequence, it could simply return the sequence. Also the actual method call contains a integer parameter indicating how many sequences should be fetched. In my pull request I patched the reader so that it actually does that.</span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>The sectionKey tags I was talking about are actually 'PRIMARY', DBREFERENCE' and "DBLINK'. I think it would be better not to ask after every single key, but simply collect all 'not known' keys together with their values and store them in a list. Then you could just read/write them without silently forget about them. Right now if any unknown tag is read, it is simply recognized and then dropped, which I do not consider friendly.</span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>Cheers</span></div><div dir="ltr"><span>Stefan</span></div><div dir="ltr"><span><br clear="none"></span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span></span></div> <div><br clear="none"><br clear="none"></div><div style="display:block"> <div style="font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"> <div style="font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"> <div dir="ltr"> <font face="Arial"> Paolo Pavan <<a rel="nofollow" shape="rect" href="mailto:paolo.pavan@gmail.com" target="_blank">paolo.pavan@gmail.com</a>> schrieb am 23:56 Dienstag, 10.Februar 2015:<br clear="none"> </font> </div><div><div> <br clear="none"><br clear="none"> <div><div><div><div dir="ltr"><div><div><div><div><div>Hi Stefan, thank you for the review.<br clear="none"></div>You are actually surprising me since if I'm not sure that the reader parser supports multiple genbank files catenated I tought instead that all the info now are full filled in the sequence object. <br clear="none">There are just few tags that are not imported (KEYWORDS_TAG, SOURCE_TAG, REFERENCE_TAG, BASE_COUNT_TAG), the documentation says that this is because they are anyway inferrable by different fields. I can also add this is because, as in the case of authors and reference tags, there is not such a property in the AbstractSequence class and I see poor sense to have it, unless you are doing this sort of swapping job. Anyway, it could be certainly added. <br clear="none"><br clear="none"></div></div>About writer and its failure in writing the reported accession: even if I can't go in deep now, it may well be that it is failing in writing InsdcLocations (also known as split locations, for example<span> <i>join(58474..59052,59052..59279)</i></span> reported in your genome file) since they have been used more sistematically by the last updated genbank reader. It may need a quick review along with db_xref qualifiers as well. <br clear="none"><br clear="none"></div>In the end about alignment, if you are using NedlemanWunsch or such in the alignment package, be sure to load them with the proper AmbiguityDNACompoundSet.<br clear="none"><br clear="none">Cheers,<br clear="none"></div>Paolo<br clear="none"></div><div><br clear="none"><div>2015-02-10 10:28 GMT+01:00 stefan harjes <span dir="ltr"><<a rel="nofollow" shape="rect" href="mailto:stefanharjes@yahoo.de" target="_blank">stefanharjes@yahoo.de</a>></span>:<br clear="none"><div><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="color:#000;background-color:#fff;font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"><div dir="ltr"><span>Hi Paolo, biojava-dev<br clear="none"></span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>I had a look myself. First I noticed, that GenbankWriter was actually more sophisticated than the Reader, as it was able to write more than one sequence. I submitted a pull request to patch GenbankReader which enables reading more than one genbank sequence from one file. When we speak of full Genbank reading capability, there are still at least 5 sectionKeys which are just ignored in the reader. I think there should be a way of simply storing them in a List and not asking for each one of them, maybe I will look there later.</span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>The writer is doing pretty well, but you should try to write 'NC_000913.gb' which crashed it in my case (writing nothing/no exception).</span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>I added two more test cases, but I think in order to really test the reader/writer capabilities we need a test where several sequences/proteins are read, merge into an array and written to stream. Upon reading this stream again, we should compare if they are still identical.</span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>Also I noticed, that you can not compare (align) a DNA sequence with non ambiguous nucleotide to a sequence with ambiguous nucleotide compounds even though a matrix dedicated for that exact comparison exists.<br clear="none"></span></div><div dir="ltr"><br clear="none"><span></span></div><div dir="ltr"><span>Cheers</span></div><div dir="ltr"><span>Stefan<br clear="none"> </span></div><div><br></div></div></div></blockquote></div></div></div></div></div></div></div></div></div></div></div></div></div></blockquote></div></div></div></div></blockquote></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></blockquote></div></div></div></div></div></div></div>
<br>_______________________________________________<br>
biojava-dev mailing list<br>
<a href="mailto:biojava-dev@mailman.open-bio.org">biojava-dev@mailman.open-bio.org</a><br>
<a href="http://mailman.open-bio.org/mailman/listinfo/biojava-dev" target="_blank">http://mailman.open-bio.org/mailman/listinfo/biojava-dev</a><br></blockquote></div>