[Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4)

Tue Jun 6 14:34:38 UTC 2006

I think it would be best to wait for the 'official response'.  I could
only locate the general changes detailed here:

http://www.bio.net/bionet/mm/genbankb/2005-December/000233.html

As far as the solution to the ever changing formats I just don't see
an elegant way. :(  The only things that comes to mind is creating
separate format "INSDseq14Format.java" and build new readers & writers
on top of that.

#1: And on that note I wanted to ask about differences between Genbank
& INSDseq parsers and a ways to retrieve certain values.  The tutorial
states that those two formats are essentialy mirror images of each
other with the latter being an XML.  When parsing Genbank files
"rs.getIdentifier()" retrieves the GI number, however, when the same
function is used on RichSequence obtained by parsing INSDseq format, I
get a 'null' value.  Moreover, I could not even locate that number
during debugging in the structure of RichSequence object.  Is there a
bug or GI number should be obtained differently???

#2: Also, what is the best way to obtain "mol_type" value from
RichSequence object???  The tutorial states that it's
"getNoteSet(Terms.getMolTypeTerm())".  I guess it' either a simplified
explanation or something has changed since .getNoteSet() does not take
any parameters.  I used
"rs.getAnnotation().asMap().get(Terms.getMolTypeTerm())" and was
wondering if that's how it was intended to be retrieved.

As always, below is the INSDseq file I tried to parse:
================================
<?xml version="1.0"?>
<!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN"
"http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
<INSDSet>
<INSDSeq>
  <INSDSeq_locus>AY069118</INSDSeq_locus>
  <INSDSeq_length>1502</INSDSeq_length>
  <INSDSeq_strandedness>single</INSDSeq_strandedness>
  <INSDSeq_moltype>mRNA</INSDSeq_moltype>
  <INSDSeq_topology>linear</INSDSeq_topology>
  <INSDSeq_division>INV</INSDSeq_division>
  <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date>
  <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
  <INSDSeq_definition>Drosophila melanogaster GH13089 full length
cDNA</INSDSeq_definition>
  <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession>
  <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version>
  <INSDSeq_other-seqids>
    <INSDSeqid>gb|AY069118.1|</INSDSeqid>
    <INSDSeqid>gi|17861571</INSDSeqid>
  </INSDSeq_other-seqids>
  <INSDSeq_keywords>
    <INSDKeyword>FLI_CDNA</INSDKeyword>
  </INSDSeq_keywords>
  <INSDSeq_source>Drosophila melanogaster (fruit fly)</INSDSeq_source>
  <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism>
  <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta;
Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy>
  <INSDSeq_references>
    <INSDReference>
      <INSDReference_reference>1 (bases 1 to 1502)</INSDReference_reference>
      <INSDReference_position>1..1502</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Stapleton,M.</INSDAuthor>
        <INSDAuthor>Brokstein,P.</INSDAuthor>
        <INSDAuthor>Hong,L.</INSDAuthor>
        <INSDAuthor>Agbayani,A.</INSDAuthor>
        <INSDAuthor>Carlson,J.</INSDAuthor>
        <INSDAuthor>Champe,M.</INSDAuthor>
        <INSDAuthor>Chavez,C.</INSDAuthor>
        <INSDAuthor>Dorsett,V.</INSDAuthor>
        <INSDAuthor>Farfan,D.</INSDAuthor>
        <INSDAuthor>Frise,E.</INSDAuthor>
        <INSDAuthor>George,R.</INSDAuthor>
        <INSDAuthor>Gonzalez,M.</INSDAuthor>
        <INSDAuthor>Guarin,H.</INSDAuthor>
        <INSDAuthor>Li,P.</INSDAuthor>
        <INSDAuthor>Liao,G.</INSDAuthor>
        <INSDAuthor>Miranda,A.</INSDAuthor>
        <INSDAuthor>Mungall,C.J.</INSDAuthor>
        <INSDAuthor>Nunoo,J.</INSDAuthor>
        <INSDAuthor>Pacleb,J.</INSDAuthor>
        <INSDAuthor>Paragas,V.</INSDAuthor>
        <INSDAuthor>Park,S.</INSDAuthor>
        <INSDAuthor>Phouanenavong,S.</INSDAuthor>
        <INSDAuthor>Wan,K.</INSDAuthor>
        <INSDAuthor>Yu,C.</INSDAuthor>
        <INSDAuthor>Lewis,S.E.</INSDAuthor>
        <INSDAuthor>Rubin,G.M.</INSDAuthor>
        <INSDAuthor>Celniker,S.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Direct Submission</INSDReference_title>
      <INSDReference_journal>Submitted (10-DEC-2001) Berkeley
Drosophila Genome Project, Lawrence Berkeley National Laboratory, One
Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal>
    </INSDReference>
  </INSDSeq_references>
  <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome
Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This
clone was sequenced as part of a high-throughput process to sequence
clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000).
The sequence has been subjected to integrity checks for sequence
accuracy, presence of a polyA tail and contiguity within 100 kb in the
genome. Thus we believe the sequence to reflect accurately this
particular cDNA clone. However, there are artifacts associated with
the generation of cDNA clones that may have not been detected in our
initial analyses such as internal priming, priming from contaminating
genomic DNA, retained introns due to reverse transcription of
unspliced precursor RNAs, and reverse transcriptase errors that result
in single base changes. For further information about this sequence,
including its location and relationship to other sequences, please
visit our Web site (http://fruitfly.berkeley.edu) or send email to
cdna at fruitfly.berkeley.edu.</INSDSeq_comment>
  <INSDSeq_feature-table>
    <INSDFeature>
      <INSDFeature_key>source</INSDFeature_key>
      <INSDFeature_location>1..1502</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>1502</INSDInterval_to>
          <INSDInterval_accession>AY069118.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>organism</INSDQualifier_name>
          <INSDQualifier_value>Drosophila melanogaster</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>mol_type</INSDQualifier_name>
          <INSDQualifier_value>mRNA</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>strain</INSDQualifier_name>
          <INSDQualifier_value>y; cn bw sp</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>taxon:7227</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>map</INSDQualifier_name>
          <INSDQualifier_value>39B3-39B3</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
    <INSDFeature>
      <INSDFeature_key>gene</INSDFeature_key>
      <INSDFeature_location>1..1502</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>1502</INSDInterval_to>
          <INSDInterval_accession>AY069118.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>gene</INSDQualifier_name>
          <INSDQualifier_value>E2f2</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>note</INSDQualifier_name>
          <INSDQualifier_value>alignment with genomic scaffold
AE003669</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
    <INSDFeature>
      <INSDFeature_key>CDS</INSDFeature_key>
      <INSDFeature_location>189..1301</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>189</INSDInterval_from>
          <INSDInterval_to>1301</INSDInterval_to>
          <INSDInterval_accession>AY069118.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>gene</INSDQualifier_name>
          <INSDQualifier_value>E2f2</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>note</INSDQualifier_name>
          <INSDQualifier_value>Longest ORF</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>codon_start</INSDQualifier_name>
          <INSDQualifier_value>1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>transl_table</INSDQualifier_name>
          <INSDQualifier_value>1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>product</INSDQualifier_name>
          <INSDQualifier_value>GH13089p</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>protein_id</INSDQualifier_name>
          <INSDQualifier_value>AAL39263.1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>GI:17861572</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>translation</INSDQualifier_name>
          <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
  </INSDSeq_feature-table>
  <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence>
</INSDSeq>
</INSDSet>
================================
On 6/6/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> I can't find any document detailing the differences between INSDseq XML
> versions 1.3 and 1.4, so I've asked the guys over in the data library
> section here to see if they have one or can produce one for me. They
> wrote it so they should know!
>
> Once I have this I'll get the INSDseq parser up-to-date. (I could go
> through the DTDs by hand and work it all out manually, but that would
> take rather longer than I've got time for at the moment!).
>
> It's a bit of a pain trying to keep the parsers up-to-date all the time,
> especially when people start wanting backwards-compatibility. Does
> anyone have any bright ideas as to how to manage version changes in file
> formats?
>
> cheers,
> Richard
>
> On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote:
> > I agree with you on that one.  However, the problem might be a little
> > deeper.  Same '?' appear in the INSDseq format bounded by
> > <INSDReference_reference> tags and cause the following exception.
> > This tells me that the '?' are actually values that are being
> > incorrectly parsed.  Further examination of the .dtd reveals that
> > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the
> > files I obtain are in the INSDSeq v. 1.4 (which among other things
> > contain a new tag <INSDReference_position>).  Here're links to both
> > .dtd's:
> >
> > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt
> >
> > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt
> >
> > I think it might be worth accommodating changes for the INSDseq
> > format, not sure how that would affect the '?' in Genbank.
> >
> > Seth
> >
> > ======================
> > org.biojava.bio.BioException: Could not read sequence
> >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > Caused by: org.biojava.bio.seq.io.ParseException:
> > org.biojava.bio.seq.io.ParseException: Bad reference line found: ?
> >         at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250)
> >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> >         ... 1 more
> > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line found: ?
> >         at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901)
> >         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
> >         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
> >         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
> >         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
> >         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
> >         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
> >         at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
> >         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
> >         at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
> >         at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97)
> >         at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246)
> >         ... 2 more
> > Java Result: -1
> > ======================
> >
> > ~~~~~~~~~~~~~~~~~~~~~~
> > <INSDSeq_references>
> >     <INSDReference>
> >       <INSDReference_reference>?</INSDReference_reference>
> >       <INSDReference_position>1..16732</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> >         <INSDAuthor>Webster,M.T.</INSDAuthor>
> >         <INSDAuthor>Vila,C.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_title>Relaxation of Selective Constraint on Dog
> > Mitochondrial DNA Following Domestication</INSDReference_title>
> >       <INSDReference_journal>Unpublished</INSDReference_journal>
> >     </INSDReference>
> >     <INSDReference>
> >       <INSDReference_reference>?</INSDReference_reference>
> >       <INSDReference_position>1..16732</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> >         <INSDAuthor>Webster,M.T.</INSDAuthor>
> >         <INSDAuthor>Vila,C.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_journal>Submitted (06-APR-2006) to the
> > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary
> > Biology, Norbyvagen 18D, Uppsala 752 36,
> > Sweden</INSDReference_journal>
> >     </INSDReference>
> >   </INSDSeq_references>
> > ~~~~~~~~~~~~~~~~~~~~~~
> >
> > On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > Hmmm... interesting. I _could_ put in a special case that ignores the
> > > question marks, but that wouldn't be 'nice' really - this is more of a
> > > problem with the program that is producing the Genbank files than a
> > > problem with the parser trying to read them. '?' is not a valid tag in
> > > the official Genbank format, and has no meaning attached to it that I
> > > can work out, so I'm reluctant to make the parser recognise it.
> > >
> > > I'd suggest you contact the people who write the software you are using
> > > to produce the Genbank files and ask them if they could stick to the
> > > rules!
> > >
> > > In the meantime you could work around the problem by stripping the
> > > question marks in some kind of pre-processor before passing it onto
> > > BioJavaX for parsing.
> > >
> > > cheers,
> > > Richard
> > >
> > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote:
> > > > Removing '?' (or several of them in my case) avoids the following exception:
> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > org.biojava.bio.BioException: Could not read sequence
> > > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957
> > > >         at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > >         ... 1 more
> > > > Java Result: -1
> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > I don't know where that previous tokenization problem came from since
> > > > I can no longer reproduce it.  This time it's more or less straight
> > > > forward.
> > > > Here's the original file with question marks:
> > > > ============================
> > > > LOCUS       DQ415957                1437 bp    mRNA    linear   VRT 01-JUN-2006
> > > > DEFINITION  Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA,
> > > >             complete cds.
> > > > ACCESSION   DQ415957
> > > > VERSION     DQ415957.1  GI:89513612
> > > > KEYWORDS    .
> > > > SOURCE      Unknown.
> > > >   ORGANISM  Unknown.
> > > >             Unclassified.
> > > > ?
> > > > ?
> > > > FEATURES             Location/Qualifiers
> > > > ?
> > > >      gene            1..1437
> > > >                      /gene="cmg2a"
> > > >      CDS             1..1437
> > > >                      /gene="cmg2a"
> > > >                      /note="cell surface receptor; similar to anthrax toxin
> > > >                      receptor 2 (ANTXR2, ATR2, CMG2)"
> > > >                      /codon_start=1
> > > >                      /product="capillary morphogenesis protein 2A"
> > > >                      /protein_id="ABD74633.1"
> > > >                      /db_xref="GI:89513613"
> > > >                      /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS
> > > >                      GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS
> > > >                      EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY
> > > >                      GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS
> > > >                      SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT
> > > >                      VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC
> > > >                      CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS
> > > >                      TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL
> > > >                      RRQYDRVSVMRPTSADKGRCMNFSRTQH"
> > > > ORIGIN
> > > >         1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc
> > > >        61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg
> > > >       121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat
> > > >       181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga
> > > >       241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc
> > > >       301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact
> > > >       361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga
> > > >       421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat
> > > >       481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg
> > > >       541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc
> > > >       601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc
> > > >       661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa
> > > >       721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc
> > > >       781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa
> > > >       841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc
> > > >       901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc
> > > >       961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa
> > > >      1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc
> > > >      1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc
> > > >      1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag
> > > >      1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag
> > > >      1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga
> > > >      1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg
> > > >      1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa
> > > > //
> > > >
> > > > ============================
> > > >
> > > >
> > > > On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > > Hi again.
> > > > >
> > > > > Could you remove the offending question mark from the GenBank file and
> > > > > try it again to see if that fixes it? The parser should just ignore it
> > > > > but apparently not. The error looks weird to me because the tokenization
> > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure what's
> > > > > going on here.
> > > > ...
> > > > >
> > > > > cheers,
> > > > > Richard
> > > > >
> > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote:
> > > > > > Hell again Richard,
> > > > > >
> > > > > > No sooner I've said about the fix of the last parsing exception than
> > > > > > another one came up with Genbank format:
> > > > > > --------------------------------------
> > > > > > org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > >         at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
> > > > > >         at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
> > > > > >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
> > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > >         at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > > > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > >         ... 3 more
> > > > > > org.biojava.bio.seq.io.ParseException:
> > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization
> > > > > > doesn't contain character: 't'
> > > > > > ----------------------------------------
> > > > > > The Genbank file that caused it is as follows:
> > > > > > =========================================
> > > > > > LOCUS       DQ431065                 425 bp    DNA     linear   INV 01-JUN-2006
> > > > > > DEFINITION  Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial
> > > > > >             sequence; mitochondrial.
> > > > > > ACCESSION   DQ431065
> > > > > > VERSION     DQ431065.1  GI:90102206
> > > > > > KEYWORDS    .
> > > > > > SOURCE      Vaccinium corymbosum
> > > > > >   ORGANISM  Vaccinium corymbosum
> > > > > >             Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
> > > > > >             Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
> > > > > >             asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae;
> > > > > >             Vaccinium.
> > > > > > ?
> > > > > > REFERENCE   2  (bases 1 to 425)
> > > > > >   AUTHORS   Naik,L.D. and Rowland,L.J.
> > > > > >   TITLE     Expressed Sequence Tags of cDNA clones from subtracted library of
> > > > > >             Vaccinium corymbosum
> > > > > >   JOURNAL   Unpublished (2005)
> > > > > > FEATURES             Location/Qualifiers
> > > > > >      source          1..425
> > > > > >                      /organism="Vaccinium corymbosum"
> > > > > >                      /mol_type="genomic DNA"
> > > > > >                      /cultivar="Bluecrop"
> > > > > >                      /db_xref="taxon:69266"
> > > > > >                      /tissue_type="Flower buds"
> > > > > >                      /clone_lib="Subtracted cDNA library of Vaccinium
> > > > > >                      corymbosum"
> > > > > >                      /dev_stage="399 hour chill unit exposure"
> > > > > >                      /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I"
> > > > > >      rRNA            <1..>425
> > > > > >                      /product="16S ribosomal RNA"
> > > > > > ORIGIN
> > > > > >         1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac
> > > > > >        61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt
> > > > > >       121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt
> > > > > >       181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta
> > > > > >       241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat
> > > > > >       301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta
> > > > > >       361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag
> > > > > >       421 cgtaa
> > > > > > //
> > > > > > ==================================
> > > > > > I think it's the presence of the '?' at the beginning of the line?!?!
> > > > > > I'm not sure wether the information that was supposed to be present
> > > > > > instead of those question marks is absent from the original ASN.1
> > > > > > batch file or it's a bug in the NCBI ASN2GO software.  It looks to me
> > > > > > that the former is the case since the file from NCBI website contains
> > > > > > much more information than the batch file. Just bringing this to
> > > > > > everyone's attention.
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > >
> > > > > >
> > > > > > Seth Johnson
> > > > > > Senior Bioinformatics Associate
> > > > > >
> > > > > > Ph: (202) 470-0900
> > > > > > Fx: (775) 251-0358
> > > > > >
> > > > > > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > > > > Hi Seth.
> > > > > > >
> > > > > > > Your second point, about the authors string not being read correctly in
> > > > > > > Genbank format, has been fixed (or should have been if I got the code
> > > > > > > right!). Could you check the latest version of biojava-live out of CVS
> > > > > > > and give it another go? Basically the parser did not recognise the
> > > > > > > CONSRTM tag, as it is not mentioned in the sample record provided by
> > > > > > > NCBI, which is what I based the parser on.
> > > > > > ...
> > > > > > >
> > > > > > > cheers,
> > > > > > > Richard
> > > > > > >
> > > > > > >
> > > > > --
> > > > > Richard Holland (BioMart Team)
> > > > > EMBL-EBI
> > > > > Wellcome Trust Genome Campus
> > > > > Hinxton
> > > > > Cambridge CB10 1SD
> > > > > UNITED KINGDOM
> > > > > Tel: +44-(0)1223-494416
> > > > >
> > > > >
> > > >
> > > >
> > > --
> > > Richard Holland (BioMart Team)
> > > EMBL-EBI
> > > Wellcome Trust Genome Campus
> > > Hinxton
> > > Cambridge CB10 1SD
> > > UNITED KINGDOM
> > > Tel: +44-(0)1223-494416
> > >
> > >
> >
> >
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>

-- 
Best Regards,

Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358