[Biojava-dev] Genbank read/write

stefan harjes stefanharjes at yahoo.de
Thu Feb 12 01:58:51 UTC 2015


Hi Paolo, Andreas,
I am sorry if I sounded disrespectful. 
I would like to point out that a new user gets confused by what has been published about the biojava library. There are several places where you strongly indicate that concatenated sequences are the intended design (see citations below).
I interpreted the below citations, that the library is aimed at concatenated sequences. However the actual reading reads only one record. I thought it to be very helpful to correct this discrepancy and submit a patch. Usually patches are reviewed by a maintainer who either accepts or rejects the pull request. And I would like to mention, that I also spent a lot of time to understand and correct the issue.
I would be glad if you find a way how to include contributions.
I would also like to mention, that there are errors during Genbank reading/writing. When I compare an original Genbank sequence to one which has been first read and then written, I can see that there are several differences between the two files. The most urgent of which is that the Location start of each feature is incremented by one for each read/write cycle. There are also some minor issues like: the version field is shortened, references and organism are dropped, keywords and source are not copied etc. So it seems you are in need of additional contributions.
citations:
from the cookbook:
 /*  * Method 2: With the GenbankReaderHelper  */ //Try with the GenbankReaderHelper File dnaFile = new File("src/test/resources/NM_000266.gb");  File protFile = new File("src/test/resources/BondFeature.gb");  LinkedHashMap<String, DNASequence> dnaSequences =                 GenbankReaderHelper.readGenbankDNASequence( dnaFile ); for (DNASequence sequence : dnaSequences.values()) {      System.out.println( sequence.getSequenceAsString() ); } without knowing the contents of 'NM_000266.gb' the reader must assume, that there are several sequences in the file as first:  The LinkedHashMap is called 'dnaSequences" with emphasis on the plural. Second if you read only one DNASequence why would you have a LinkedHashMap and why would you loop over one! sequence? Correct me if I am wrong, but in my opinion the cookbook expects concatenated sequences per single file.
For non concatenated sequences speaks, that the method itself is named 'readGenbankDNASequences'. So I looked into the method to gain more clarity.
from the source code of GenbankReader:/*** This method tries to parse maximum <code>max</code> records from* the open File or InputStream, and leaves the underlying resource open.<br>...
The introducing comment of the method clearly speaks of multiple records. Themethod is called with a parameter 'max=-1' to indicate that all records of thefile should be read. Interestingly the parameter max is not mentioned again in the following code and thus not implemented. 
So do you not agree, that the design discussion of whether or not concatenatedsequence files are expected is not decided in your library?
Best regardsStefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20150212/bfe2eda3/attachment.html>


More information about the biojava-dev mailing list