[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Seth Johnson
johnson.biotech at gmail.com
Mon Jun 5 14:22:57 UTC 2006
I apologize again for not posting the stacktrace. Here it is:
==========================
org.biojava.bio.BioException: Could not read sequence
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
at exonhit.parsers.GenBankParser.main(GenBankParser.java:347)
Caused by: java.lang.NullPointerException
at org.biojavax.bio.seq.io.SimpleRichSequenceBuilder.addFeatureProperty(SimpleRichSequenceBuilder.java:356)
at org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:853)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97)
at org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246)
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
... 1 more
Java Result: -1
============================
Here's the XML that causes that exception (taken out of a bigger file
of several hundred sequences):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<INSDSeq>
<INSDSeq_locus>DQ485973</INSDSeq_locus>
<INSDSeq_length>1356</INSDSeq_length>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>ENV</INSDSeq_division>
<INSDSeq_update-date>08-MAY-2006</INSDSeq_update-date>
<INSDSeq_create-date>08-MAY-2006</INSDSeq_create-date>
<INSDSeq_definition>Uncultured Mollicutes bacterium clone P7 16S
ribosomal RNA gene, partial sequence</INSDSeq_definition>
<INSDSeq_primary-accession>DQ485973</INSDSeq_primary-accession>
<INSDSeq_accession-version>DQ485973.1</INSDSeq_accession-version>
<INSDSeq_other-seqids>
<INSDSeqid>gb|DQ485973.1|</INSDSeqid>
<INSDSeqid>gi|94482885</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_keywords>
<INSDKeyword>ENV</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_source>uncultured Mollicutes bacterium</INSDSeq_source>
<INSDSeq_organism>uncultured Mollicutes bacterium</INSDSeq_organism>
<INSDSeq_taxonomy>Bacteria; Firmicutes; Mollicutes; environmental
samples</INSDSeq_taxonomy>
<INSDSeq_references>
<INSDReference>
<INSDReference_reference>1 (bases 1 to 1356)</INSDReference_reference>
<INSDReference_position>1..1356</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Kostanjsek,R.</INSDAuthor>
<INSDAuthor>Strus,J.</INSDAuthor>
<INSDAuthor>Avgustin,G.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>A novel lineage of Mollicutes associated
with the hindgut wall of the terrestrial isopod Porcellio scaber
(Crustacea: Isopoda)</INSDReference_title>
<INSDReference_journal>Unpublished</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>2 (bases 1 to 1356)</INSDReference_reference>
<INSDReference_position>1..1356</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Kostanjsek,R.</INSDAuthor>
<INSDAuthor>Strus,J.</INSDAuthor>
<INSDAuthor>Avgustin,G.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-APR-2006) Department of
Biology, Biotechnical Faculty, University of Ljubljana, Vecna Pot 111,
Ljubljana 1000, Slovenia</INSDReference_journal>
</INSDReference>
</INSDSeq_references>
<INSDSeq_feature-table>
<INSDFeature>
<INSDFeature_key>source</INSDFeature_key>
<INSDFeature_location>1..1356</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>1</INSDInterval_from>
<INSDInterval_to>1356</INSDInterval_to>
<INSDInterval_accession>DQ485973.1</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>organism</INSDQualifier_name>
<INSDQualifier_value>uncultured Mollicutes
bacterium</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>mol_type</INSDQualifier_name>
<INSDQualifier_value>genomic DNA</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>isolation_source</INSDQualifier_name>
<INSDQualifier_value>isopod gut</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>specific_host</INSDQualifier_name>
<INSDQualifier_value>Porcellio scaber</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>db_xref</INSDQualifier_name>
<INSDQualifier_value>taxon:220137</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>clone</INSDQualifier_name>
<INSDQualifier_value>P7</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>environmental_sample</INSDQualifier_name>
</INSDQualifier>
</INSDFeature_quals>
</INSDFeature>
<INSDFeature>
<INSDFeature_key>rRNA</INSDFeature_key>
<INSDFeature_location><1..>1356</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>1</INSDInterval_from>
<INSDInterval_to>1356</INSDInterval_to>
<INSDInterval_accession>DQ485973.1</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_partial5 value="true"/>
<INSDFeature_partial3 value="true"/>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>product</INSDQualifier_name>
<INSDQualifier_value>16S ribosomal RNA</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
</INSDFeature>
</INSDSeq_feature-table>
<INSDSeq_sequence>AACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGAACTGCCCCTGAACTAAAAGAAGTGCTTGCACGGAAGTTAGGGACGGAATTTGCAGTTAGTGGCGAACGGGTGAGTAACACGTGGGTAACCTACCATAGAGATTGGGATAACTGTTGGAAACGACAGCTAAAACCGAATAAGATTAATTCTACAAAGAGGAATAATTTAAATAGGCGTTTGCCTAGCTTTATGATGGGCCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCGACGATGCATAGCCGGACTGAGAGGTTGAACGGCCACATTGGGACTGAGACACGGCCCAGACAACTACGGTTGGCAGCAGTAGGGAATTTTTCGCAATGGACGAAAGTCTGACGGAGCAATGCCGCGTGAGTGAAGACGGTTTTCGGATTGTAAAACTCTGTTGTGTGGGGGGAACACCTATATGAGAGGAATTGCTCATTAATTGACGCCACCACACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCGAGCGTTTTCCGGAATTATTGGGCGTAAAGAGCGTGTAGGCGGGTATGAATAAGTCTGGTGTGAAATCTAAGTGGCTCAACCACTTAAATTGCATTGGAAACTGCCAAACTAGAATACGGAGGGGTAAGTGGAATTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAGGGACACCAATGGCGAAGGCAGCTTAATGGACCCGAGATTGACGCTGAGACGCGAAAGCTTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTTAAACGATGAGTGCTAGGTATTGGATTAATTTCAGTGCCCGGAGTTAACGCATTAAGCCCTCCGCCTGAGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGTGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCAAAACTTGACATCCCCTGCGAAGCTATAGAAGTATAGTGGAGGTTATCAGGGTGACAGATGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTAGGTTAAGTCCTGCAACGAGCGCAACCCCTGTCTGCAGTTGCTACCATTAAGTTGAGGACTCTGCAGAGACTGCTAGTGTAAGCTAGAGGAAGGTGGGGATGACGTCAAATCATCATGCCTCTTACGTTTTGGGCTACACACGTGCTACAATGGCTGATACAAAGGGCTGCGAACTCGCGAGAGTAAGCGAATCCCAAAAAGTCAGTCTAAGTTCGGATTGAAGTTCTGCAACTCGACTTTCATGAAGTCGGAATGCNCTAGTAATACG</INSDSeq_sequence>
</INSDSeq>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> This one should be fixed in CVS now. Typo on my behalf - I put in code
> to make it work with both 87+ and pre-87 version of EMBL, then got the
> regexes the wrong way round!!
>
> Could you send the full stacktrace for the INSDseq format problem you're
> having? (The one where you say you've tracked it down to the qualifier
> value being missing). I can't see anything wrong there, so I need the
> stacktrace in order to know which exact sequence of events is throwing
> the exception.
>
> cheers,
> Richard
>
>
> On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote:
> > Hi Richard,
> >
> > I made sure I have the latest source code from CVS compiled
> > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy
> > to report that GenBank issue is solved!!!!
> > As far as EMBL parsing, I apologize for not providing the stack dump
> > for ISSUE #1. Here's the dump of the exception:
> > --------------------------------------------------------
> > org.biojava.bio.BioException: Could not read sequence
> > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359)
> > Caused by: java.lang.NumberFormatException: null
> > at java.lang.Integer.parseInt(Integer.java:415)
> > at java.lang.Integer.parseInt(Integer.java:497)
> > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299)
> > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > ... 1 more
> > Java Result: -1
> > -------------------------------------------------------
> > Here, again, is the code that I'm using to to parse:
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > BufferedReader gbBR = null;
> > try {
> > gbBR = new BufferedReader(new
> > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb"));
> > } catch (FileNotFoundException fnfe) {
> > fnfe.printStackTrace();
> > System.exit(-1);
> > }
> > Namespace gbNspace = (Namespace)
> > RichObjectFactory.getObject(SimpleNamespace.class, new
> > Object[]{"gbSpace"} );
> > RichSequenceIterator gbSeqs =
> > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace);
> > while (gbSeqs.hasNext()) {
> > try {
> > RichSequence rs = gbSeqs.nextRichSequence();
> > NCBITaxon myTaxon = rs.getTaxon();
> > }catch (BioException be){
> > be.printStackTrace();
> > System.exit(-1);
> > }
> > }
> > ~~~~~~~~~~~~~~~~~~~~~~~~~
> > And here's the EMBL file that I'm trying to parse:
> > +++++++++++++++++++++++++
> > ID DQ472184 standard; DNA; INV; 546 BP.
> > XX
> > AC DQ472184;
> > XX
> > SV DQ472184.1
> > DT 15-MAY-2006
> > XX
> > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > DE complete cds.
> > XX
> > KW .
> > XX
> > OS Trypanosoma cruzi strain CL Brener
> > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > OC Schizotrypanum.
> > XX
> > RN [1]
> > RP 1-546
> > RA De Melo L.D.B.;
> > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > RL Unpublished.
> > XX
> > RN [2]
> > RP 1-546
> > RA De Melo L.D.B.;
> > RT ;
> > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > RL 21949-900, Brazil
> > XX
> > FH Key Location/Qualifiers
> > FH
> > FT source 1..546
> > FT /organism="Trypanosoma cruzi strain CL Brener"
> > FT /mol_type="genomic DNA"
> > FT /strain="CL Brener"
> > FT /db_xref="taxon:353153"
> > FT gene <1..>546
> > FT /gene="ARC21"
> > FT /note="TcARC21"
> > FT mRNA <1..>546
> > FT /gene="ARC21"
> > FT /product="actin-related protein 3"
> > FT CDS 1..546
> > FT /gene="ARC21"
> > FT /note="actin-binding protein; ARPC3 21 kDa; putative
> > FT member of Arp2/3 complex"
> > FT /codon_start=1
> > FT /product="actin-related protein 3"
> > FT /protein_id="ABF13401.1"
> > FT /db_xref="GI:93360014"
> > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60
> > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120
> > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180
> > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240
> > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300
> > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360
> > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420
> > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480
> > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540
> > agttag 546
> > //
> > ID DQ472185 standard; DNA; INV; 543 BP.
> > XX
> > AC DQ472185;
> > XX
> > SV DQ472185.1
> > DT 15-MAY-2006
> > XX
> > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > DE complete cds.
> > XX
> > KW .
> > XX
> > OS Trypanosoma cruzi strain CL Brener
> > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > OC Schizotrypanum.
> > XX
> > RN [1]
> > RP 1-543
> > RA De Melo L.D.B.;
> > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > RL Unpublished.
> > XX
> > RN [2]
> > RP 1-543
> > RA De Melo L.D.B.;
> > RT ;
> > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > RL 21949-900, Brazil
> > XX
> > FH Key Location/Qualifiers
> > FH
> > FT source 1..543
> > FT /organism="Trypanosoma cruzi strain CL Brener"
> > FT /mol_type="genomic DNA"
> > FT /strain="CL Brener"
> > FT /db_xref="taxon:353153"
> > FT gene <1..>543
> > FT /gene="ARC20"
> > FT /note="TcARC20"
> > FT mRNA <1..>543
> > FT /gene="ARC20"
> > FT /product="actin-related protein 4"
> > FT CDS 1..543
> > FT /gene="ARC20"
> > FT /note="actin-binding protein; ARPC4 20 kDa; putative
> > FT member of Arp2/3 complex"
> > FT /codon_start=1
> > FT /product="actin-related protein 4"
> > FT /protein_id="ABF13402.1"
> > FT /db_xref="GI:93360016"
> > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > FT MKLNVNQRARRAAMEFFLALNFT"
> > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60
> > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120
> > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180
> > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240
> > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300
> > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360
> > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420
> > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480
> > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540
> > tga 543
> > //
> > +++++++++++++++++++++++++++++++++
> >
> > It looks to me like there's some kind of problem with parsing the
> > sequence version number. I even tried the sequence from test directory
> > (AY069118.em) with same outcome.
> >
> > Regards,
> >
> > Seth
> >
> > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > Hi Seth.
> > >
> > > Your second point, about the authors string not being read correctly in
> > > Genbank format, has been fixed (or should have been if I got the code
> > > right!). Could you check the latest version of biojava-live out of CVS
> > > and give it another go? Basically the parser did not recognise the
> > > CONSRTM tag, as it is not mentioned in the sample record provided by
> > > NCBI, which is what I based the parser on.
> > >
> > > I've set it up now so that it reads the CONSRTM tag, but the value is
> > > merged with the authors tag with (consortium) appended. There will still
> > > be problems if the consortium value has commas in it - not sure how to
> > > fix this yet.
> > >
> > > Your first point is harder to solve because you did not provide a
> > > complete stack trace for the exceptions you are getting. The complete
> > > stack trace would enable me to identify exactly where things are going
> > > wrong and give me a better chance of fixing them. Could you send the
> > > stack trace, and I'll see what I can do.
> > >
> > > cheers,
> > > Richard
> > >
> > >
> > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote:
> > > > Hi All,
> > > >
> > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some
> > > > clarification on several issues that I'm having.
> > > > I am developing a parser that would take as input "NCBI Incremental
> > > > ASN.1 Sequence Updates to Genbank" files (
> > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
> > > > ASN2GB converter (
> > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
> > > > resulting sequences to a format parsable by BioJava(X) (
> > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
> > > > my problems start.
> > > >
> > > > ISSUE 1:
> > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank
> > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
> > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank
> > > > format is recognized by the
> > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
> > > > some exceptions that I'll describe in issue #2. This is the code that
> > > > I'm using to parse, for example, the EMBL output:
> > > >
> > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb"));
> > > > Namespace gbNspace = (Namespace)
> > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > Object[]{"gbSpace"} );
> > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
> > > > while (gbSeqs.hasNext()) {
> > > > try {
> > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > // Further processing or RichSequence object from here
> > > >
> > > > } catch (BioException be){
> > > > be.printStackTrace();
> > > > }
> > > > }
> > > >
> > > > The multi-sequence EMBL file looks like this:
> > > > ---------------------------------------------------------------------------------
> > > > ID DQ472184 standard; DNA; INV; 546 BP.
> > > > XX
> > > > AC DQ472184;
> > > > XX
> > > > SV DQ472184.1
> > > > DT 15-MAY-2006
> > > > XX
> > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > > > DE complete cds.
> > > > XX
> > > > KW .
> > > > XX
> > > > OS Trypanosoma cruzi strain CL Brener
> > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > > OC Schizotrypanum.
> > > > XX
> > > > RN [1]
> > > > RP 1-546
> > > > RA De Melo L.D.B.;
> > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > > RL Unpublished.
> > > > XX
> > > > RN [2]
> > > > RP 1-546
> > > > RA De Melo L.D.B.;
> > > > RT ;
> > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > > RL 21949-900, Brazil
> > > > XX
> > > > FH Key Location/Qualifiers
> > > > FH
> > > > FT source 1..546
> > > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > > FT /mol_type="genomic DNA"
> > > > FT /strain="CL Brener"
> > > > FT /db_xref="taxon:353153"
> > > > FT gene <1..>546
> > > > FT /gene="ARC21"
> > > > FT /note="TcARC21"
> > > > FT mRNA <1..>546
> > > > FT /gene="ARC21"
> > > > FT /product="actin-related protein 3"
> > > > FT CDS 1..546
> > > > FT /gene="ARC21"
> > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative
> > > > FT member of Arp2/3 complex"
> > > > FT /codon_start=1
> > > > FT /product="actin-related protein 3"
> > > > FT /protein_id="ABF13401.1"
> > > > FT /db_xref="GI:93360014"
> > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60
> > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120
> > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180
> > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240
> > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300
> > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360
> > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420
> > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480
> > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540
> > > > agttag 546
> > > > //
> > > > ID DQ472185 standard; DNA; INV; 543 BP.
> > > > XX
> > > > AC DQ472185;
> > > > XX
> > > > SV DQ472185.1
> > > > DT 15-MAY-2006
> > > > XX
> > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > > > DE complete cds.
> > > > XX
> > > > KW .
> > > > XX
> > > > OS Trypanosoma cruzi strain CL Brener
> > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > > OC Schizotrypanum.
> > > > XX
> > > > RN [1]
> > > > RP 1-543
> > > > RA De Melo L.D.B.;
> > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > > RL Unpublished.
> > > > XX
> > > > RN [2]
> > > > RP 1-543
> > > > RA De Melo L.D.B.;
> > > > RT ;
> > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > > RL 21949-900, Brazil
> > > > XX
> > > > FH Key Location/Qualifiers
> > > > FH
> > > > FT source 1..543
> > > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > > FT /mol_type="genomic DNA"
> > > > FT /strain="CL Brener"
> > > > FT /db_xref="taxon:353153"
> > > > FT gene <1..>543
> > > > FT /gene="ARC20"
> > > > FT /note="TcARC20"
> > > > FT mRNA <1..>543
> > > > FT /gene="ARC20"
> > > > FT /product="actin-related protein 4"
> > > > FT CDS 1..543
> > > > FT /gene="ARC20"
> > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative
> > > > FT member of Arp2/3 complex"
> > > > FT /codon_start=1
> > > > FT /product="actin-related protein 4"
> > > > FT /protein_id="ABF13402.1"
> > > > FT /db_xref="GI:93360016"
> > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > > FT MKLNVNQRARRAAMEFFLALNFT"
> > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60
> > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120
> > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180
> > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240
> > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300
> > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360
> > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420
> > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480
> > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540
> > > > tga 543
> > > > //
> > > > -----------------------------------------------------------------------
> > > > I get an exception message "Could Not Read Sequence". Same thing
> > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
> > > > with the following INSDset file (beginning of the file):
> > > >
> > > > <?xml version="1.0"?>
> > > > <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
> > > > <INSDSeq>
> > > > <INSDSeq_locus>DQ022078</INSDSeq_locus>
> > > > <INSDSeq_length>16729</INSDSeq_length>
> > > > <INSDSeq_moltype>DNA</INSDSeq_moltype>
> > > > <INSDSeq_topology>linear</INSDSeq_topology>
> > > > <INSDSeq_division>ENV</INSDSeq_division>
> > > > <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
> > > > <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
> > > > <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
> > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
> > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
> > > > class C (estA3), putative permease (a3.005), putative transmembrane
> > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
> > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
> > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
> > > > protein (a3.012), putative membrane protease subunit (a3.013),
> > > > putative haloalkane dehalogenase (a3.014), putative transcriptional
> > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
> > > > hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
> > > > <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
> > > > <INSDSeq_other-seqids>
> > > > <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
> > > > <INSDSeqid>gi|71842722</INSDSeqid>
> > > > </INSDSeq_other-seqids>
> > > > <INSDSeq_keywords>
> > > > <INSDKeyword>ENV</INSDKeyword>
> > > > </INSDSeq_keywords>
> > > > <INSDSeq_references>
> > > > <INSDReference>
> > > > <INSDReference_reference>?</INSDReference_reference>
> > > > <INSDReference_position>1..16729</INSDReference_position>
> > > > <INSDReference_authors>
> > > > <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > <INSDAuthor>Elend,C.</INSDAuthor>
> > > > <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > </INSDReference_authors>
> > > > <INSDReference_title>Isolation and biochemical characterization
> > > > of two novel metagenome derived esterases</INSDReference_title>
> > > > <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
> > > > (2006)</INSDReference_journal>
> > > > </INSDReference>
> > > > <INSDReference>
> > > > <INSDReference_reference>?</INSDReference_reference>
> > > > <INSDReference_position>1..16729</INSDReference_position>
> > > > <INSDReference_authors>
> > > > <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > <INSDAuthor>Elend,C.</INSDAuthor>
> > > > <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > </INSDReference_authors>
> > > > <INSDReference_journal>Submitted (29-APR-2005) to the
> > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
> > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
> > > > Germany</INSDReference_journal>
> > > > </INSDReference>
> > > > </INSDSeq_references>
> > > >
> > > > So my question is wether the ASN2GB produces output that's
> > > > incompatible with BioJava parsers or is there a problem with the
> > > > sequence themselves or the problems with the majority of parsers???
> > > > Could it be that I'm using the API wrongly for the above formats,
> > > > although GenBank parser works as advertised with some exceptions
> > > > below:
> > > >
> > > > ISSUE #2:
> > > > When I try to parse GenBank files using the following code:
> > > >
> > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb"));
> > > > Namespace gbNspace = (Namespace)
> > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > Object[]{"gbSpace"} );
> > > > RichSequenceIterator gbSeqs =
> > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
> > > > while (gbSeqs.hasNext()) {
> > > > try {
> > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > // Further processing or RichSequence object from here
> > > >
> > > > } catch (BioException be){
> > > > be.printStackTrace();
> > > > }
> > > > }
> > > >
> > > > Genbank file in question:
> > > >
> > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006
> > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
> > > > IMAGE:30915482), complete cds.
> > > > ACCESSION BC074905
> > > > VERSION BC074905.2 GI:50959825
> > > > KEYWORDS MGC.
> > > > SOURCE Homo sapiens (human)
> > > > ORGANISM Homo sapiens
> > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> > > > Catarrhini; Hominidae; Homo.
> > > > REFERENCE 1 (bases 1 to 838)
> > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
> > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
> > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
> > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
> > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
> > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
> > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
> > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
> > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
> > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
> > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
> > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
> > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
> > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
> > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
> > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
> > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
> > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
> > > > CONSRTM Mammalian Gene Collection Program Team
> > > > TITLE Generation and initial analysis of more than 15,000 full-length
> > > > human and mouse cDNA sequences
> > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
> > > > PUBMED 12477932
> > > > REFERENCE 2 (bases 1 to 838)
> > > > CONSRTM NIH MGC Project
> > > > TITLE Direct Submission
> > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian
> > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA
> > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov
> > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832.
> > > > Contact: MGC help desk
> > > > Email: cgapbs-r at mail.nih.gov
> > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer
> > > > Center
> > > > cDNA Library Preparation: British Columbia Cancer Research Center
> > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
> > > > DNA Sequencing by: Genome Sequence Centre,
> > > > BC Cancer Agency, Vancouver, BC, Canada
> > > > info at bcgsc.bc.ca
> > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
> > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
> > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
> > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
> > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
> > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
> > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR
> > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
> > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.
> > > >
> > > > Clone distribution: MGC clone distribution information can be found
> > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
> > > > Series: IRBU Plate: 4 Row: C Column: 3.
> > > >
> > > > Differences found between this sequence and the human reference
> > > > genome (build 36) are described in misc_difference features below.
> > > > FEATURES Location/Qualifiers
> > > > source 1..838
> > > > /organism="Homo sapiens"
> > > > /mol_type="mRNA"
> > > > /db_xref="taxon:9606"
> > > > /clone="MGC:104038 IMAGE:30915482"
> > > > /tissue_type="Lung, PCR rescued clones"
> > > > /clone_lib="NIH_MGC_273"
> > > > /lab_host="DH10B"
> > > > /note="Vector: pCR4 Topo TA with reversed insert"
> > > > gene 1..838
> > > > /gene="KLK14"
> > > > /note="synonym: KLK-L6"
> > > > /db_xref="GeneID:43847"
> > > > /db_xref="HGNC:6362"
> > > > /db_xref="IMGT/GENE-DB:6362"
> > > > /db_xref="MIM:606135"
> > > > CDS 49..804
> > > > /gene="KLK14"
> > > > /codon_start=1
> > > > /product="KLK14 protein"
> > > > /protein_id="AAH74905.1"
> > > > /db_xref="GI:50959826"
> > > > /db_xref="GeneID:43847"
> > > > /db_xref="HGNC:6362"
> > > > /db_xref="IMGT/GENE-DB:6362"
> > > > /db_xref="MIM:606135"
> > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
> > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
> > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
> > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
> > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
> > > > misc_difference 98
> > > > /gene="KLK14"
> > > > /note="'G' in cDNA is 'A' in the human genome; amino acid
> > > > difference: 'R' in cDNA, 'Q' in the human genome."
> > > > misc_difference 133
> > > > /gene="KLK14"
> > > > /note="'T' in cDNA is 'C' in the human genome; amino acid
> > > > difference: 'Y' in cDNA, 'H' in the human genome."
> > > > ORIGIN
> > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg
> > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag
> > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg
> > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact
> > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg
> > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac
> > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg
> > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga
> > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc
> > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg
> > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct
> > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc
> > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt
> > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
> > > > //
> > > >
> > > > I get the following exception:
> > > >
> > > > java.lang.IllegalArgumentException: Authors string cannot be null
> > > > org.biojava.bio.BioException: Could not read sequence
> > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
> > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
> > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
> > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null
> > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
> > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
> > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > >
> > > > -----------------------------------------------------------------------
> > > >
> > > > I'm trying to see what could be the problem with this particular
> > > > sequence. Looks to me like the AUTHORS portion is not getting parsed
> > > > correctly. Any ideas would be greatly appreciated!
> > > >
> > > --
> > > Richard Holland (BioMart Team)
> > > EMBL-EBI
> > > Wellcome Trust Genome Campus
> > > Hinxton
> > > Cambridge CB10 1SD
> > > UNITED KINGDOM
> > > Tel: +44-(0)1223-494416
> > >
> > >
> >
> >
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>
--
Best Regards,
Seth Johnson
Senior Bioinformatics Associate
Ph: (202) 470-0900
Fx: (775) 251-0358
More information about the Biojava-l
mailing list