[Biojava-l] File formats and other questions

Matthew Pocock mrp@sanger.ac.uk
Sat, 05 May 2001 15:18:14 +0100


Hi David,

David Waring wrote:

> I am new to the bioinformatics community and Biojava, so bear with me if
> these topics have been discussed before.
> 
> Among the many file formats for DNA sequence data, I see that there is Java
> support for Fasta, GFF, and EMBL. In addition to these, I will be working
> with a few other formats and am wondering if anyone in the community is
> working with these (listed below) either within the biojava model or not.
> 
> ace files (the output of phrap not AceDB)
> We have a very nice parser (based on jlex and cup) and all classes needed to
> work with this format. They are not, however, developed using the biojava
> API. I have not looked into what it would take to make them compliant, but I
> don't think it would be very difficult.

There is currently no direct support for ace files. However, it should 
be fairly easy for you to take your existing parser and get it to 
generate BioJava-complient feature objects.

> 
> HTGS (the NCBI format for submission of High Throughput Genomic Sequenceing)
> These are ASN.1 files, specifically Seq-submit. The NCBI toolkit has all
> sorts of tools for handleing these but they are all in C. Are there Java
> tools available also?
> 
> Is anyone familiar with the OSS Java tools for ASN.1? They can read an ASN.1
> spec. and create Java classes for that data model and then read and write
> the ASN.1 files. Looks like it could be a useful tool but could be pricey.

I've never used any Java ASN.1 tools, but others on this list may have. 
Since we are open-source, anything more expensive than free is too 
pricey as a pre-requisite for the main library. As always, it is usualy 
fairly easy to bridge external data to the BioJava interfaces, at which 
point you can leverage all of the power and flexibility of BioJava for 
your ASN.1 data.

> 
> Other questions:
> Are there packages that can retrieve seqences from Genbank by accession
> number or gid?
> 

Jason? You were working on this, weren't you?

> Does anyone have classes for alignment of sequences using Smith-Waterman?
> 

There is a complete DP toolkit (in org.biojava.bio.dp), but it appears 
to be broken at the moment. The org.biojava.bio.program packages may 
contain parsers for programs like ssearch already, and if not, they 
would be nice to have.

Matthew

> 
> 
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> |   David Waring
> |   Systems Programmer
> |   University of Washington Genome Center
> |   dwaring@u.washington.edu
> |   (206) 221-6902
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l