[BioRuby] Genbank file parsing question

Josh Earl joshearl1 at hotmail.com
Mon Sep 17 15:39:44 UTC 2012





Hi Nick,
Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email.  This might be more handy:
http://pastebin.com/N1D7jUuu 
I'm running into several issues.  The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example):
bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk')  ==> #<Bio::FlatFile:0x00000005237800 @stream=#<Bio::FlatFile::BufferedInputStream:0x000000051bd3c0 @io="/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk", @path=nil, @buffer="">, @dbclass=Bio::GenBank, @splitter=#<Bio::FlatFile::Splitter::Default:0x000000050f3778 @dbclass=Bio::GenBank, @stream=#<Bio::FlatFile::BufferedInputStream:0x000000051bd3c0 @io="/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk", @path=nil, @buffer="">, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry'        from (irb):4:in `first'        from (irb):4        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in <top (required)>'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `<top (required)>'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `<main>'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval'        from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `<main>
opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok.  Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently.  
Loading into this object truncates the Locus id from:
ctg7180000000048 toctg7180000
i.e.bioruby> gb.first.locus.entry_id  ==> "ctg7180000"
And if I attempt to say something like:bioruby> gb.first.organism  ==> ""
It is just an empty string.  Does this variable not get set for each genbank entry?  The organism is listed under the "source" attribute in the file.  
Not all of these are really errors per se, but odd behavior.
~josh

> Hi Josh,
> 
> I've used the Bio gem to parse several Genbank files from NCBI. The snippet you provided looks like it should be handled correctly; except it is missing newlines.
> 
> Could you provide more specific details about the errors you are receiving?
> 
> -Nick
> 
> -- 
> Nick Thrower
> Information Technologist
> Michigan State University
> Great Lakes Bioenergy Research Center
> East Lansing MI 48824
> 
> On Sep 14, 2012, at 12:00 PM, bioruby-request at lists.open-bio.org wrote:
> 
> > Send BioRuby mailing list submissions to
> > 	bioruby at lists.open-bio.org
> > 
> > To subscribe or unsubscribe via the World Wide Web, visit
> > 	http://lists.open-bio.org/mailman/listinfo/bioruby
> > or, via email, send a message with subject or body 'help' to
> > 	bioruby-request at lists.open-bio.org
> > 
> > You can reach the person managing the list at
> > 	bioruby-owner at lists.open-bio.org
> > 
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of BioRuby digest..."
> > Today's Topics:
> > 
> >   1. Genbank file parsing question (Josh Earl)
> > 
> > From: Josh Earl <joshearl1 at hotmail.com>
> > Date: September 13, 2012 1:50:34 PM EDT
> > To: <bioruby at lists.open-bio.org>
> > Subject: [BioRuby] Genbank file parsing question
> > 
> > 
> > 
> > Hello all,
> > I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever).  The idea is that the annotation service that we use (RAST - 
> > http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting.  They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function.  My question is, what should I do?  Write my own parser, or try and fiddle with the Bioruby implementation  or something else entirely?  I'm fairly new to ruby, but I've been programming for a long time.  
> > ~josh
> > 
> > P.S.  Here is a short section of what the RAST GenBank file looks like (just a single short contig):
> > LOCUS       ctg7180000000028         4191 bp    DNA     linear   UNK DEFINITION  Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION   unknownFEATURES             Location/Qualifiers     source          1..4191                     /mol_type="genomic DNA"                     /db_xref="taxon: 82135"                     /genome_md5=""                     /project="earl_82135"                     /genome_id="82135.3"                     /organism="Atopobium vaginae B758"     CDS             complement(10..1740)                     /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA                     RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG                     PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI                     WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV                     FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA                     RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHRE!
>  ES!
> > TTHADQ                     PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL                     IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH                     AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT                     VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD                     ELRAPCDVAT"                     /product="hypothetical protein"     CDS             complement(1759..1875)                     /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL"                     /product="hypothetical protein"     CDS             complement(1844..2461)                     /db_xref="GO:0008830"                     /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET                     YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR                     AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM                     WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG"                     /p!
>  r!
> > oduct="dTDP-4-dehydrorhamnose 3,5-epimerase (EC                     5.
> > 1.3.13)"                     /EC_number="5.1.3.13"     CDS             complement(2586..2741)                     /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC                     VALCAFP"                     /product="hypothetical protein"     CDS             complement(2798..3193)                     /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET                     AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE                     HMQKTGADVVIGSRFVDDALLLVVCRHNC"                     /product="Glycosyltransferase involved in cell wall                     biogenesis (EC 2.4.-.-)"     CDS             3238..3393                     /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC                     VALCAFP"                     /product="hypothetical protein"     CDS             3518..4135                     /db_xref="GO:0008830"                     /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET                     Y!
>  KA!
> > SDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR                     AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM                     WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG"                     /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC                     5.1.3.13)"                     /EC_number="5.1.3.13"BASE COUNT     1077 a   1055 c   1036 g   1023 tORIGIN              1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct       61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata      121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac      181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc      241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga      301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat      361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc      421 agctacagct tggtttttct gtgattgttc cacgttcata cgc!
>  a!
> > cataca taatgcgcgc      481 atcggtctta gtatagctat gaagccagcg tgcaatatct
> > gtttgttggt tgggagtgcg      541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc      601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat      661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag      721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca      781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg      841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc      901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga      961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac     1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat     1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc     1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct     1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca     1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc a!
>  a!
> > ataccaat     1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc     1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa     1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac     1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc     1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg     1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg     1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat     1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg     1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt     1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg     1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc     1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg     2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat    !
>   !
> > 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca
> >     2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt     2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct     2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa     2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg     2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca     2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc     2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata     2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt     2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag     2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac     2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa     2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt     2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt     2941 gca!
>  c!
> > aactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag     3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg     3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc     3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta     3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg     3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca     3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc     3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat     3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt     3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc     3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat     3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc     3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac     3721 aaaaggcgtg cttcg!
>  t!
> > ggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg     3781 tgttgtgcgt g
> > gctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt     3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg     3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga     3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat     4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga     4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc     4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g//
> > Center for Genomic Sciences
> > (412)-359-8341 		 	   		  
> > 
> > 
> > 
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> 
> 
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
> 
> 
> End of BioRuby Digest, Vol 84, Issue 6
> **************************************

 		 	   		  



More information about the BioRuby mailing list