[Biojava-l] biojavax GenbankFormat and legacy genbank records

mark.schreiber at novartis.com mark.schreiber at novartis.com
Wed Mar 15 07:11:55 UTC 2006


Hi -

I'm happy for the regexps in GenbankFormat and EMBLFormat etc to be 
relaxed a little as long as the parsing of fully valid genbank files 
doesn't suffer. If someone wants to test this thoroughly it would be a 
great benefit to the whole community.

In some cases it may not be possible. For example if a feature doesn't 
have sufficient information to build a proper RichFeature object I don't 
think we should allow the file.

I might be good to make a collection in CVS of example files that are 
known to have broken the parser in the past (the files folder in the test 
suite would be an ideal place).

- Mark

Mark Schreiber
Research Investigator (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910





"Bubba Puryear" <bubba.puryear at gmail.com>
Sent by: biojava-l-bounces at lists.open-bio.org
03/14/2006 01:27 AM

 
        To:     biojava-l at lists.open-bio.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] biojavax GenbankFormat and legacy genbank records


Hello,

I work on a webapp for a biotech company that uses biojava to parse
plasmid and feature maps (genbank flatfile format)  and we store them in a
local database. I've wanted to update the version of biojava we use 
because
the current CVS parser handles features that cross the origin on plasmid
maps much better than the parser in 1.4.

However, we have a lot of data in various databases that have genbank
records formatted in some of the older incarnations of the GFF. In
particular, some feature maps don't have ACCESSION fields, and/or are
missing modification dates and genbank divisions on the LOCUS line. When I
try to parse one of those maps with biojavax, I get parse errors.

Should there perhaps be a LegacyGenbankFormat or should the GenbankFormat
class be made more tolerant? I know NCBI made several changes to their
flatfile format in part  because writing parsers for the older specs was
tricky. So I'm not sure which direction the bio* folks would like to go 
with
this.

I've attached a small example map that causes parse problems. The data in
the map is completely bogus, but the structure was taken from a real map
file I have to deal with.

The following code snippet illustrates my problems:

BufferedReader br = new BufferedReader(new
StringReader(genbankContent));
try {
RichSequenceIterator sequences = IOTools.readGenbankDNA(br,
null);
if (sequences.hasNext()) {
this.sequence = sequences.nextRichSequence();
}
} catch (Exception e) {
e.printStackTrace();
}


where genbankContent is a String containing the contents of the attached
file.

Thanks much,
Bubba Puryear

_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

[ Attachment ''FOO.GB'' removed by Mark Schreiber ]





More information about the Biojava-l mailing list