[Biojava-dev] Suggestions for the BioJava project

Sat Aug 13 05:29:56 EDT 2005

Dear BioJava developers,

I am and currently working with BioSQL and BioJava. Trying to insert 
the NCBI taxon hierachy and also different kinds of biosequences into 
the database I got to know that there are still some open problems in
BioJava. Since I wrote some classes anyway, I would like to contribute 
them to the community.

The first thing I am missing in BioJava is that there is no class that
really helps to parse the information given by the NCBI names.dmp and
nodes.dmp which can be downloaded at the NCBI ftp site and contain all 
the relevant information about the current taxon tree of know species.
Constructing Taxon objects containing all the information and inserting
them into the BioSQL database was very hard. Especially because the
current TaxonSQL class only inserts the name, name_class, ncbi_taxon_id
and parent_id into the database.
The other information, which is already considered by the SQL scheme
like genetic code, mitochondrial genetic code etc. will be lost, even 
if it is included in the Annotation of the Taxon instance. This can be 
seen by looking at the insert statemens in the class TaxonSQL.
Additionally a Map data structure is used to access the 
EbiFormat.PROPERTY_TAXON_NAMES in class TaxonSQL.
This only allows us to store exactly one synonym, one includes, one
equvalent name and so on of one certain species. Some species have more
than one synonym and so on. This information will be lost using just a 
Map structure. I would suggest to use a Map and every key should point
to a Set which contains the other names. Except there is exactly one
name (as it is supposed to be with scientific name). This could be 
realised by a simple case destinction. 
On the other hand, it might be usefull to retrive data about the taxa from
the database not only by the NCBI-ID. For some purposes it might be
sensefull to have a method to access the taxa by one of the names. This 
is why I would suggest to add the method:

public static Taxon getTaxon(Connection conn, String name)
  throws BioRuntimeException

to retrive the data by name. Attached to this mail you will find my 
extended version of TaxonSQL including these functions. I also added a
function that gives an array of Strings where every entry is one 
scientific name. So you can easly see, which species are already inserted
in the database.
I modified the functions to put and retrieve taxa into the database so
that they consider that the names Map could sometimes point to a Set of
names instead of  normal name Strings.
In addition I considered the other information that can be inserted in
the database like genetic code and so on.
However, there are some methods in TaxonSQL.java that are defined to be
private. If they were protected, an extension of this class could call them.
But because they're private, I had to copy and paste them.
To read the names.dmp and the nodes.dmp from the NCBI ftp site I wrote
another parser, because there was no one. This class constructes a whole
tree of all given species containing Taxon objects in a TaxonFactory
instance with all the information mentioned above, which would normally not
be included. The SQL scheme also has attributs called left value and right
value. These are given by a depth first search through the taxon tree.
Whenever the algorithm visits a node for the first time, the left value is
set according to an incremental counter. Then all children will be visited.
The right value is set when all children of the node are visited. This gives
the property that all children of a node have left values and right values
between the values of this node.
The problem is that this function may cause a StackOverFlow. Maybe this
should already be realised in the TaxonFactory containing all nodes. The
other problem is that the children of a node don't have further children,
except you use the search methtod of the TaxonFactory to find the child
again. 
Then the children of the children may appear (if there are some). The same
is valid for the parent of a node's parent. This is why we always have to
call the search method again and again while traversing the taxon tree. I
don't know if this is very efficient, because I didn't really look at the
source code of the search method. It would be nice to find something more
efficient to traverse a whole taxon tree to set the left value and the right
value.
The same problem also occurs, if one wants to add a whole subtree of taxa to
the database. It might cause a StackOverFlow, if one does it with dfs.

These are my suggestions to the BioJava-Project. I am looking forward to
your response and comments.
By the way, it would also be nice to tell the user of BioJava that the
attribute 'synonym' in the table 'term_synonym' in the BioSQL scheme should
be renamed to 'name'. I was woundering all the time to get an error message
till I figured out that.

Yours sincerely
Andreas Dräger 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: NCBITaxonParser.java
Type: text/x-java
Size: 19410 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biojava-dev/attachments/20050813/2408f953/NCBITaxonParser-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MyTaxonSQL.java
Type: text/x-java
Size: 17010 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biojava-dev/attachments/20050813/2408f953/MyTaxonSQL-0001.bin