[Biojava-l] Genbank parser error [biojavax]

mark.schreiber at novartis.com mark.schreiber at novartis.com
Wed Feb 15 04:00:44 EST 2006


Hi Morgane -

Turned out to be a problem with a greedy regexp parsing the LOCUS tag. 
This is fixed in CVS. Let me know if something else is a problem.

- Mark





Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
Sent by: biojava-l-bounces at portal.open-bio.org
02/14/2006 09:33 PM

 
        To:     biojava-l at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-l] Genbank  parser error [biojavax]


Hello Mark,

My file is indeed too large to be posted.
So I have exported a smaller sequence from Ensembl that I tested with 
the parser. The behavior is the same.
You will find below this "Genbank" formatted file enclosed.

Thanks for your help,

Morgane.

LOCUS       6 3498 bp DNA HTG 14-FEB-2006
DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
            52305503..52309000 reannotated via EnsEMBL
ACCESSION   chromosome:NCBIM34:6:52305503:52309000:1
VERSION     chromosome:NCBIM34:6:52305503:52309000:1
KEYWORDS    .
SOURCE      House mouse
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; 
Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muridae; Murinae; Mus.
COMMENT     This sequence was annotated by the Ensembl system. Please 
visit the
            Ensembl web site, http://www.ensembl.org/ for more information.
COMMENT     All feature locations are relative to the first (5') base of 
the
            sequence in this file.  The sequence presented is always the
            forward strand of the assembly. Features that lie outside of 
the
            sequence contained in this file have clonal location 
coordinates in
            the format: .:..
COMMENT     The /gene indicates a unique id for a gene,
            /note="transcript_id=..." a unique id for a transcript, 
/protein_id
            a unique id for a peptide and note="exon_id=..." a unique id 
for an
            exon. These ids are maintained wherever possible between 
versions.
COMMENT     All the exons and transcripts in Ensembl are confirmed by
            similarity to either protein or cDNA sequences.
FEATURES             Location/Qualifiers
     source          1..3498
                     /organism="Mus musculus"
                     /db_xref="taxon:10090"
     gene            complement(506..2826)
                     /gene=ENSMUSG00000014704
     mRNA            join(complement(2261..2826),complement(506..1620))
                     /gene="ENSMUSG00000014704"
                     /note="transcript_id=ENSMUST00000014848"
     CDS             join(complement(2261..2639),complement(881..1620))
                     /gene="ENSMUSG00000014704"
                     /protein_id="ENSMUSP00000014848"
                     /note="transcript_id=ENSMUST00000014848"
                     /db_xref="MarkerSymbol:Hoxa2"
                     /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
                     /db_xref="RefSeq_peptide:NP_034581.1"
                     /db_xref="RefSeq_dna:NM_010451.1"
                     /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
                     /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
                     /db_xref="EntrezGene:15399"
                     /db_xref="AgilentProbe:A_51_P501803"
                     /db_xref="EMBL:AB039184"
                     /db_xref="EMBL:AB039185"
                     /db_xref="EMBL:AB039186"
                     /db_xref="EMBL:AB039187"
                     /db_xref="EMBL:AB039188"
                     /db_xref="EMBL:AB039189"
                     /db_xref="EMBL:AB039190"
                     /db_xref="EMBL:AB039191"
                     /db_xref="EMBL:AB039192"
                     /db_xref="EMBL:AK134501"
                     /db_xref="EMBL:M87801"
                     /db_xref="EMBL:M93148"
                     /db_xref="EMBL:M93292"
                     /db_xref="EMBL:M95599"
                     /db_xref="GO:GO:0003700"
                     /db_xref="GO:GO:0005634"
                     /db_xref="GO:GO:0006355"
                     /db_xref="GO:GO:0007275"
                     /db_xref="IPI:IPI00132242.1"
                     /db_xref="UniGene:Mm.131"
                     /db_xref="protein_id:AAA37827.1"
                     /db_xref="protein_id:AAA37834.1"
                     /db_xref="protein_id:AAA37835.1"
                     /db_xref="protein_id:AAA37836.1"
                     /db_xref="protein_id:BAB68708.1"
                     /db_xref="protein_id:BAB68709.1"
                     /db_xref="protein_id:BAB68710.1"
                     /db_xref="protein_id:BAB68711.1"
                     /db_xref="protein_id:BAB68712.1"
                     /db_xref="protein_id:BAB68713.1"
                     /db_xref="protein_id:BAB68714.1"
                     /db_xref="protein_id:BAB68715.1"
                     /db_xref="protein_id:BAB68716.1"
                     /db_xref="protein_id:BAE22163.1"
                     /db_xref="AFFY_MG_U74Av2:102643_at"
                     /db_xref="AFFY_MG_U74Cv2:171063_at"
                     /db_xref="AFFY_Mouse430A_2:1419602_at"
                     /db_xref="AFFY_Mouse430_2:1419602_at"
 /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
 STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
 KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
 NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
 VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
 EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
                     ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
     exon            complement(506..1620)
                     /note="exon_id=ENSMUSE00000387033"
     exon            complement(2261..2826)
                     /note="exon_id=ENSMUSE00000193269"
BASE COUNT  938 a 815 c 882 g 863 t
ORIGIN
        1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA 
ATTTTTGATA
       61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC 
ACTCCACTCG
      121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG 
CTTGGGCTAG
      181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG 
GCCTGAGTCT
      241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA 
AAAAAAAAAA
      301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT 
TTGTTGCAGG
      361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG 
TGACCAGACT
      421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT 
TGAGAAAGAG
      481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA 
CCAAAAATAC
      541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG 
ACAATTTATG
      601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA 
AGCTTGTTGG
      661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC 
TTTAAAACTG
      721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG 
GGTAGATCAA
      781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA 
CCTGGTCAAA
      841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC 
AGATGCTGTA
      901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG 
ATATCTACAG
      961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC 
AGGCAGGAAT
     1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG 
GGACTGTCAT
     1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA 
ACAGTGGGTG
     1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA 
ACTGGGAAAG
     1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA 
TTTTGCTGAA
     1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC 
TCAAAGAGTG
     1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA 
AATTTCCCTT
     1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC 
CGGTTCTGAA
     1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC 
ACCCTGCGGG
     1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC 
TGAGTGTTGG
     1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT 
TCCAGGGATT
     1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG 
GGTCCGAGCA
     1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA 
AATGGCCGCC
     1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG 
GGAAGCCCAG
     1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC 
ATCCGGGAGC
     1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT 
AGCTGAGCAA
     1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA 
CTAGACAAGA
     1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG 
AAAGTGCCCC
     2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC 
CCTCCACCAA
     2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC 
TCTCTCCCCC
     2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC 
CGGAGGGGGA
     2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC 
GAGGCAGGCA
     2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT 
CTTCTCCTTC
     2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC 
GCGACTGCCC
     2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG 
ACTGCCCGGG
     2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG 
TGAAAGCGTC
     2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA 
TGTCAGGCAC
     2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC 
GTAATTCATG
     2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG 
CTTTGGGGGG
     2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG 
AAGATCGCTG
     2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG 
CTACTATTAA
     2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA 
CATGATTGCT
     2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT 
GATTGATCCA
     2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC 
ACTTTTTTTC
     3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC 
GTGGGGGGCG
     3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA 
GTGTGTGTGT
     3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG 
CCTCCCCCGC
     3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA 
AATCATTTAA
     3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT 
CAAAGTTTTG
     3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG 
AAAGGAGCAG
     3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA 
GAGAGAGAGA
     3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC 
TCTTCCTCCT
     3481 CTTTTTCCAA AATCAGTT
//




mark.schreiber at novartis.com wrote:

>Hi Morgane -
>
>I have to say that doesn't look much like Genbank : )
>
>The biojavax parser are possibly a bit brittle due to their use of 
regexps 
>to recognize key elements. It should be fixable, I think the problem is 
>that the parser expects a word after LOCUS not a number. This may not be 
>the only problem though. Could you post the entire file? Or if it is 
large 
>then a representative file of smaller size.
>
>- Mark
>
>
>
>
>
>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>Sent by: biojava-l-bounces at portal.open-bio.org
>02/14/2006 04:36 AM
>
> 
>        To:     biojava-l at biojava.org
>        cc:     (bcc: Mark Schreiber/GP/Novartis)
>        Subject:        [Biojava-l] Genbank  parser error [biojavax]
>
>
>Hello,
>
>I have tried biojavax today with a view to use the Genbank file parser.
>
>My test file is a Genbank formatted file which has been produced by 
>Ensembl export system.
>
>The head of the file is as follow :
>
>LOCUS       6 489671 bp DNA HTG 13-FEB-2006
>DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
>            52296503..52786173 reannotated via EnsEMBL
>ACCESSION   chromosome:NCBIM34:6:52296503:52786173:1
>VERSION     chromosome:NCBIM34:6:52296503:52786173:1
>
>I used the code provided in biojavax docbook to parse this file.
>I get the following error :
>
>Exception in thread "main" org.biojava.bio.BioException: Could not read 
>sequence
>    at 
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
>    at 
>org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
>Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line found: 
>6 489671 bp DNA HTG 13-FEB-2006
>    at 
>org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
>    at 
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
>    ... 1 more
>
>I had a look at GenbankFormat.java, and I guess the problem comes from 
>the regular expression that do not recognize the LOCUS as a standard 
>Genbank file LOCUS tag.
>
>Am I wrong ? Have biojavax Genbank parser been tested on Ensembl 
>exported files ?
>
>Morgane.
>
> 
>

-- 
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)

Vrije Universiteit Brussels (VUB) 
Laboratory of Cell Genetics 
Pleinlaan 2 
1050 Brussels 
Belgium 

Tel : +32 2 629 15 22 
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
http://emmanuel.clement.free.fr/navigateurs/comparatif.htm

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l





More information about the Biojava-l mailing list