[Biojava-l] Genbank parser error [biojavax]

Morgane THOMAS-CHOLLIER mthomasc at vub.ac.be
Wed Feb 15 05:04:22 EST 2006


Hi Mark,

I have downloaded the fixed version and tested it with my large file. 
Works great.

Thank you very much,

Morgane.

mark.schreiber at novartis.com wrote:

>Hi Morgane -
>
>Turned out to be a problem with a greedy regexp parsing the LOCUS tag. 
>This is fixed in CVS. Let me know if something else is a problem.
>
>- Mark
>
>
>
>
>
>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>Sent by: biojava-l-bounces at portal.open-bio.org
>02/14/2006 09:33 PM
>
> 
>        To:     biojava-l at biojava.org
>        cc:     (bcc: Mark Schreiber/GP/Novartis)
>        Subject:        Re: [Biojava-l] Genbank  parser error [biojavax]
>
>
>Hello Mark,
>
>My file is indeed too large to be posted.
>So I have exported a smaller sequence from Ensembl that I tested with 
>the parser. The behavior is the same.
>You will find below this "Genbank" formatted file enclosed.
>
>Thanks for your help,
>
>Morgane.
>
>LOCUS       6 3498 bp DNA HTG 14-FEB-2006
>DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
>            52305503..52309000 reannotated via EnsEMBL
>ACCESSION   chromosome:NCBIM34:6:52305503:52309000:1
>VERSION     chromosome:NCBIM34:6:52305503:52309000:1
>KEYWORDS    .
>SOURCE      House mouse
>  ORGANISM  Mus musculus
>            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; 
>Euteleostomi;
>            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
>            Sciurognathi; Muridae; Murinae; Mus.
>COMMENT     This sequence was annotated by the Ensembl system. Please 
>visit the
>            Ensembl web site, http://www.ensembl.org/ for more information.
>COMMENT     All feature locations are relative to the first (5') base of 
>the
>            sequence in this file.  The sequence presented is always the
>            forward strand of the assembly. Features that lie outside of 
>the
>            sequence contained in this file have clonal location 
>coordinates in
>            the format: .:..
>COMMENT     The /gene indicates a unique id for a gene,
>            /note="transcript_id=..." a unique id for a transcript, 
>/protein_id
>            a unique id for a peptide and note="exon_id=..." a unique id 
>for an
>            exon. These ids are maintained wherever possible between 
>versions.
>COMMENT     All the exons and transcripts in Ensembl are confirmed by
>            similarity to either protein or cDNA sequences.
>FEATURES             Location/Qualifiers
>     source          1..3498
>                     /organism="Mus musculus"
>                     /db_xref="taxon:10090"
>     gene            complement(506..2826)
>                     /gene=ENSMUSG00000014704
>     mRNA            join(complement(2261..2826),complement(506..1620))
>                     /gene="ENSMUSG00000014704"
>                     /note="transcript_id=ENSMUST00000014848"
>     CDS             join(complement(2261..2639),complement(881..1620))
>                     /gene="ENSMUSG00000014704"
>                     /protein_id="ENSMUSP00000014848"
>                     /note="transcript_id=ENSMUST00000014848"
>                     /db_xref="MarkerSymbol:Hoxa2"
>                     /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
>                     /db_xref="RefSeq_peptide:NP_034581.1"
>                     /db_xref="RefSeq_dna:NM_010451.1"
>                     /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
>                     /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
>                     /db_xref="EntrezGene:15399"
>                     /db_xref="AgilentProbe:A_51_P501803"
>                     /db_xref="EMBL:AB039184"
>                     /db_xref="EMBL:AB039185"
>                     /db_xref="EMBL:AB039186"
>                     /db_xref="EMBL:AB039187"
>                     /db_xref="EMBL:AB039188"
>                     /db_xref="EMBL:AB039189"
>                     /db_xref="EMBL:AB039190"
>                     /db_xref="EMBL:AB039191"
>                     /db_xref="EMBL:AB039192"
>                     /db_xref="EMBL:AK134501"
>                     /db_xref="EMBL:M87801"
>                     /db_xref="EMBL:M93148"
>                     /db_xref="EMBL:M93292"
>                     /db_xref="EMBL:M95599"
>                     /db_xref="GO:GO:0003700"
>                     /db_xref="GO:GO:0005634"
>                     /db_xref="GO:GO:0006355"
>                     /db_xref="GO:GO:0007275"
>                     /db_xref="IPI:IPI00132242.1"
>                     /db_xref="UniGene:Mm.131"
>                     /db_xref="protein_id:AAA37827.1"
>                     /db_xref="protein_id:AAA37834.1"
>                     /db_xref="protein_id:AAA37835.1"
>                     /db_xref="protein_id:AAA37836.1"
>                     /db_xref="protein_id:BAB68708.1"
>                     /db_xref="protein_id:BAB68709.1"
>                     /db_xref="protein_id:BAB68710.1"
>                     /db_xref="protein_id:BAB68711.1"
>                     /db_xref="protein_id:BAB68712.1"
>                     /db_xref="protein_id:BAB68713.1"
>                     /db_xref="protein_id:BAB68714.1"
>                     /db_xref="protein_id:BAB68715.1"
>                     /db_xref="protein_id:BAB68716.1"
>                     /db_xref="protein_id:BAE22163.1"
>                     /db_xref="AFFY_MG_U74Av2:102643_at"
>                     /db_xref="AFFY_MG_U74Cv2:171063_at"
>                     /db_xref="AFFY_Mouse430A_2:1419602_at"
>                     /db_xref="AFFY_Mouse430_2:1419602_at"
> /translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
> STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
> KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
> NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
> VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
> EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
>                     ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
>     exon            complement(506..1620)
>                     /note="exon_id=ENSMUSE00000387033"
>     exon            complement(2261..2826)
>                     /note="exon_id=ENSMUSE00000193269"
>BASE COUNT  938 a 815 c 882 g 863 t
>ORIGIN
>        1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA 
>ATTTTTGATA
>       61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC 
>ACTCCACTCG
>      121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG 
>CTTGGGCTAG
>      181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG 
>GCCTGAGTCT
>      241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA 
>AAAAAAAAAA
>      301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT 
>TTGTTGCAGG
>      361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG 
>TGACCAGACT
>      421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT 
>TGAGAAAGAG
>      481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA 
>CCAAAAATAC
>      541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG 
>ACAATTTATG
>      601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA 
>AGCTTGTTGG
>      661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC 
>TTTAAAACTG
>      721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG 
>GGTAGATCAA
>      781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA 
>CCTGGTCAAA
>      841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC 
>AGATGCTGTA
>      901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG 
>ATATCTACAG
>      961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC 
>AGGCAGGAAT
>     1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG 
>GGACTGTCAT
>     1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA 
>ACAGTGGGTG
>     1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA 
>ACTGGGAAAG
>     1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA 
>TTTTGCTGAA
>     1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC 
>TCAAAGAGTG
>     1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA 
>AATTTCCCTT
>     1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC 
>CGGTTCTGAA
>     1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC 
>ACCCTGCGGG
>     1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC 
>TGAGTGTTGG
>     1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT 
>TCCAGGGATT
>     1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG 
>GGTCCGAGCA
>     1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA 
>AATGGCCGCC
>     1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG 
>GGAAGCCCAG
>     1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC 
>ATCCGGGAGC
>     1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT 
>AGCTGAGCAA
>     1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA 
>CTAGACAAGA
>     1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG 
>AAAGTGCCCC
>     2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC 
>CCTCCACCAA
>     2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC 
>TCTCTCCCCC
>     2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC 
>CGGAGGGGGA
>     2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC 
>GAGGCAGGCA
>     2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT 
>CTTCTCCTTC
>     2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC 
>GCGACTGCCC
>     2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG 
>ACTGCCCGGG
>     2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG 
>TGAAAGCGTC
>     2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA 
>TGTCAGGCAC
>     2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC 
>GTAATTCATG
>     2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG 
>CTTTGGGGGG
>     2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG 
>AAGATCGCTG
>     2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG 
>CTACTATTAA
>     2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA 
>CATGATTGCT
>     2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT 
>GATTGATCCA
>     2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC 
>ACTTTTTTTC
>     3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC 
>GTGGGGGGCG
>     3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA 
>GTGTGTGTGT
>     3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG 
>CCTCCCCCGC
>     3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA 
>AATCATTTAA
>     3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT 
>CAAAGTTTTG
>     3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG 
>AAAGGAGCAG
>     3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA 
>GAGAGAGAGA
>     3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC 
>TCTTCCTCCT
>     3481 CTTTTTCCAA AATCAGTT
>//
>
>
>
>
>mark.schreiber at novartis.com wrote:
>
>  
>
>>Hi Morgane -
>>
>>I have to say that doesn't look much like Genbank : )
>>
>>The biojavax parser are possibly a bit brittle due to their use of 
>>    
>>
>regexps 
>  
>
>>to recognize key elements. It should be fixable, I think the problem is 
>>that the parser expects a word after LOCUS not a number. This may not be 
>>the only problem though. Could you post the entire file? Or if it is 
>>    
>>
>large 
>  
>
>>then a representative file of smaller size.
>>
>>- Mark
>>
>>
>>
>>
>>
>>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>>Sent by: biojava-l-bounces at portal.open-bio.org
>>02/14/2006 04:36 AM
>>
>>
>>       To:     biojava-l at biojava.org
>>       cc:     (bcc: Mark Schreiber/GP/Novartis)
>>       Subject:        [Biojava-l] Genbank  parser error [biojavax]
>>
>>
>>Hello,
>>
>>I have tried biojavax today with a view to use the Genbank file parser.
>>
>>My test file is a Genbank formatted file which has been produced by 
>>Ensembl export system.
>>
>>The head of the file is as follow :
>>
>>LOCUS       6 489671 bp DNA HTG 13-FEB-2006
>>DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
>>           52296503..52786173 reannotated via EnsEMBL
>>ACCESSION   chromosome:NCBIM34:6:52296503:52786173:1
>>VERSION     chromosome:NCBIM34:6:52296503:52786173:1
>>
>>I used the code provided in biojavax docbook to parse this file.
>>I get the following error :
>>
>>Exception in thread "main" org.biojava.bio.BioException: Could not read 
>>sequence
>>   at 
>>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
>>   at 
>>org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
>>Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line found: 
>>6 489671 bp DNA HTG 13-FEB-2006
>>   at 
>>org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
>>   at 
>>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
>>   ... 1 more
>>
>>I had a look at GenbankFormat.java, and I guess the problem comes from 
>>the regular expression that do not recognize the LOCUS as a standard 
>>Genbank file LOCUS tag.
>>
>>Am I wrong ? Have biojavax Genbank parser been tested on Ensembl 
>>exported files ?
>>
>>Morgane.
>>
>>
>>
>>    
>>
>
>  
>

-- 
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)

Vrije Universiteit Brussels (VUB)    
Laboratory of Cell Genetics          
Pleinlaan 2                          
1050 Brussels                        
Belgium                              

Tel : +32 2 629 15 22                		     
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
http://emmanuel.clement.free.fr/navigateurs/comparatif.htm



More information about the Biojava-l mailing list