[Biojava-l] File formats and other questions

David Waring dwaring@u.washington.edu
Thu, 3 May 2001 12:54:14 -0700


I am new to the bioinformatics community and Biojava, so bear with me if
these topics have been discussed before.

Among the many file formats for DNA sequence data, I see that there is Java
support for Fasta, GFF, and EMBL. In addition to these, I will be working
with a few other formats and am wondering if anyone in the community is
working with these (listed below) either within the biojava model or not.

ace files (the output of phrap not AceDB)
We have a very nice parser (based on jlex and cup) and all classes needed to
work with this format. They are not, however, developed using the biojava
API. I have not looked into what it would take to make them compliant, but I
don't think it would be very difficult.

HTGS (the NCBI format for submission of High Throughput Genomic Sequenceing)
These are ASN.1 files, specifically Seq-submit. The NCBI toolkit has all
sorts of tools for handleing these but they are all in C. Are there Java
tools available also?

Is anyone familiar with the OSS Java tools for ASN.1? They can read an ASN.1
spec. and create Java classes for that data model and then read and write
the ASN.1 files. Looks like it could be a useful tool but could be pricey.

Other questions:
Are there packages that can retrieve seqences from Genbank by accession
number or gid?

Does anyone have classes for alignment of sequences using Smith-Waterman?



|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|   David Waring
|   Systems Programmer
|   University of Washington Genome Center
|   dwaring@u.washington.edu
|   (206) 221-6902
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||