[BioRuby] Genbank file parsing question

Josh Earl joshearl1 at hotmail.com
Thu Sep 13 17:50:34 UTC 2012


Hello all,
I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever).  The idea is that the annotation service that we use (RAST - 
http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting.  They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function.  My question is, what should I do?  Write my own parser, or try and fiddle with the Bioruby implementation  or something else entirely?  I'm fairly new to ruby, but I've been programming for a long time.  
~josh

P.S.  Here is a short section of what the RAST GenBank file looks like (just a single short contig):
LOCUS       ctg7180000000028         4191 bp    DNA     linear   UNK DEFINITION  Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION   unknownFEATURES             Location/Qualifiers     source          1..4191                     /mol_type="genomic DNA"                     /db_xref="taxon: 82135"                     /genome_md5=""                     /project="earl_82135"                     /genome_id="82135.3"                     /organism="Atopobium vaginae B758"     CDS             complement(10..1740)                     /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA                     RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG                     PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI                     WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV                     FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA                     RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHREESTTHADQ                     PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL                     IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH                     AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT                     VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD                     ELRAPCDVAT"                     /product="hypothetical protein"     CDS             complement(1759..1875)                     /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL"                     /product="hypothetical protein"     CDS             complement(1844..2461)                     /db_xref="GO:0008830"                     /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET                     YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR                     AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM                     WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG"                     /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC                     5.1.3.13)"                     /EC_number="5.1.3.13"     CDS             complement(2586..2741)                     /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC                     VALCAFP"                     /product="hypothetical protein"     CDS             complement(2798..3193)                     /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET                     AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE                     HMQKTGADVVIGSRFVDDALLLVVCRHNC"                     /product="Glycosyltransferase involved in cell wall                     biogenesis (EC 2.4.-.-)"     CDS             3238..3393                     /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC                     VALCAFP"                     /product="hypothetical protein"     CDS             3518..4135                     /db_xref="GO:0008830"                     /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET                     YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR                     AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM                     WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG"                     /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC                     5.1.3.13)"                     /EC_number="5.1.3.13"BASE COUNT     1077 a   1055 c   1036 g   1023 tORIGIN              1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct       61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata      121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac      181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc      241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga      301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat      361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc      421 agctacagct tggtttttct gtgattgttc cacgttcata cgcacataca taatgcgcgc      481 atcggtctta gtatagctat gaagccagcg tgcaatatct gtttgttggt tgggagtgcg      541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc      601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat      661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag      721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca      781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg      841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc      901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga      961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac     1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat     1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc     1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct     1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca     1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc aaataccaat     1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc     1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa     1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac     1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc     1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg     1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg     1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat     1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg     1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt     1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg     1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc     1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg     2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat     2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca     2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt     2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct     2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa     2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg     2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca     2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc     2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata     2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt     2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag     2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac     2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa     2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt     2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt     2941 gcacaactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag     3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg     3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc     3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta     3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg     3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca     3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc     3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat     3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt     3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc     3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat     3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc     3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac     3721 aaaaggcgtg cttcgtggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg     3781 tgttgtgcgt ggctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt     3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg     3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga     3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat     4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga     4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc     4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g//
Center for Genomic Sciences
(412)-359-8341 		 	   		  



More information about the BioRuby mailing list