[BioRuby] Parsing a file in Swissprot format

n at bioruby.org n at bioruby.org
Fri Dec 2 11:16:22 EST 2005


Hi Urban,


Thank you for your error report, but, it seems that your input data is
no swissprot format text. 


(1)
You can find the SwissProt SQ line format at 
<http://www.expasy.ch/sprot/userman.html#SQ_line>, and the sequence
data line format at
<http://www.expasy.ch/sprot/userman.html#Seq_line>.


(2)
You can make a parser class for your swissprot like format like here,

  require 'bio/db'
  require 'bio/db/embl/common'

  module Bio
    class SwissProtLike < EMBLDB
      include Bio::EMBLDB::Common

      def sq
        Bio::Sequence::AA.new( fetch('SQ').gsub(/ |\d+/, '') )
      end
    end
  end


You can save the above code piece as "swissprotlike.rb" and you can
use it to parse your files like here,

  require 'swissprotlike'
  entry = Bio::SwissProtLike.new( File.read("entry.file") )
  entry.accession #=> "SM0000001"
  entry.sq #=> "AAGCTTAATGTATATAATCTTTTAGAGGTAAAATCTACAGCCAGCAAAAGTCATGGTAAATATTCTTTGACTGAACTCTCACTAAACTCCTCTAAATTATATGTCATATTAACTGGTTAAATTAATATAAATTTGTGACATGACCTTAACTGGTTAGGTAGGATATTTTTCTTCATGCAAAAATATGACTAATAATAATTTAGCACAAAAATATTTCCCAATACTTTAATTCTGTGATAGAAAAATGTTTAACTCAGCTACTATAATCCCATAATTTTGAAAACTATTTATTAGCTTTTGTGTTTGACCCTTCCCTAGCCAAAGGCAACTATTTAAGGACCCTTTAAAACTCTTGAAACTACTTTAGAGTC"



Best,
Mitsuteru 
-
Mitsuteru Nakao                            <nakao-mitsuteru at aist.go.jp>
                                         <http://seq.cbrc.jp/%7Enakao/>
Sequence Analysis Team                             <http://seq.cbrc.jp>
Compuational Biology Research Center (CBRC)        <http://www.cbrc.jp>
National Institute of Advanced Industrial Science and Technology (AIST)


BioRuby Project <http://bioruby.org>
b-Src <http://b-Src.cbrc.jp/markup>



From: Urban Hafner <urban at bettong.net>
Subject: [BioRuby] Parsing a file in Swissprot format
Date: Fri, 02 Dec 2005 14:00:01 +0100

> Hej everybody,
> I'm new to BioRuby and I think I'm doing something wrong while parsing a
> file in Swissprot format. What I'm trying to do is to get the sequence
> out of it. I do it like this:
> 
> sequence = Bio::SPTR.new(File.new(f).read)
> p sequence.sq
> 
> But that doesn't work it gives me this error message:
> 
> /home/users/hafner/lib/site_ruby/1.8/bio/db/embl/sptr.rb:706:in `sq':
> Invalid SQ Line:  (RuntimeError)
> 'AAGCTTAATGTATATAATCTTTTAGAGGTAAAATCTACAGCCAGCAAAAGTCATGGTAAA
> TATTCTTTGACTGAACTCTCACTAAACTCCTCTAAATTATATGTCATATTAACTGGTTAA
> ATTAATATAAATTTGTGACATGACCTTAACTGGTTAGGTAGGATATTTTTCTTCATGCAA
> AAATATGACTAATAATAATTTAGCACAAAAATATTTCCCAATACTTTAATTCTGTGATAG
> AAAAATGTTTAACTCAGCTACTATAATCCCATAATTTTGAAAACTATTTATTAGCTTTTG
> TGTTTGACCCTTCCCTAGCCAAAGGCAACTATTTAAGGACCCTTTAAAACTCTTGAAACT
> ACTTTAGAGTC'     from diplomarbeit/tools/smartdb-entries-without-
> sequence.rb:10
> 
> I"m not sure if this is BioRuby's (I'm using the version from CVS) fault
> or if the input file is faulty.
> 
> Does anybody have a clue what I'm doing wrong here?
> 
> Cheers, Urban
> 
> Here's my input file:
> 
> AC   SM0000001
> XX   
> DT   1.1.1999 00:00:00 (created); ili
> DT   8.12.2004 12:49:00 (updated); ili2
> XX   
> NA   MOUSE$kappa-MAR
> XX   
> OS   mouse, Mus spec.
> OC   eukaryota; animalia; metazoa; chordata; vertebrata;
> OC   tetrapoda; mammalia; eutheria; rodentia; myomorpha; muridae;
> OC   murinae
> XX   
> HO   human, rabbit [2]
> XX   
> SZ   371 bp
> XX   
> DE   G000538; immunoglobulin kappa light chain
> DP   Direction: 3'; Pos 1: ATG
> DN   Internal: y; 
> DC   between joining and constant regions [1]; ~200 bp
> DC   upstream of the kappa enhancer [1]
> XX   
> SQ   AAGCTTAATGTATATAATCTTTTAGAGGTAAAATCTACAGCCAGCAAAAGTCATGGTAAA
> SQ   TATTCTTTGACTGAACTCTCACTAAACTCCTCTAAATTATATGTCATATTAACTGGTTAA
> SQ   ATTAATATAAATTTGTGACATGACCTTAACTGGTTAGGTAGGATATTTTTCTTCATGCAA
> SQ   AAATATGACTAATAATAATTTAGCACAAAAATATTTCCCAATACTTTAATTCTGTGATAG
> SQ   AAAAATGTTTAACTCAGCTACTATAATCCCATAATTTTGAAAACTATTTATTAGCTTTTG
> SQ   TGTTTGACCCTTCCCTAGCCAAAGGCAACTATTTAAGGACCCTTTAAAACTCTTGAAACT
> SQ   ACTTTAGAGTC
> SC   [7]
> XX   
> FT   2 - 11: cleavage by topoisomerase II [3]
> FT   2 - 15: deleted in plasmacytoma PC 7183 [3]
> FT   5 - 14: cleavage by topoisomerase II [3]
> FT   5 - 14: 5'-recombination junction [3]
> FT   8 - 17: cleavage by topoisomerase II [3]
> FT   10 - 19: cleavage by topoisomerase II [3]
> FT   32 - 41: cleavage by topoisomerase II [3]
> FT   53 - 62: cleavage by Drosophila topoisomerase II only
> FT   [3]
> FT   68 - 77: cleavage by topoisomerase II [3]
> FT   69 - 78: cleavage by topoisomerase II [3]
> FT   73 - 82: cleavage by topoisomerase II [3]
> FT   98 - 107: cleavage by Drosophila topoisomerase II only
> FT   [3]
> FT   147 - 156: cleavage by topoisomerase II [3]
> FT   163 - 284: confers MAR-like features upon any DNA when
> FT   contiguously reiterated in the same molecule
> FT   [7]
> FT   164 - 170: similar motif found in human PARP MAR
> FT   SM0000116 [8]
> FT   182 - 191: cleavage by topoisomerase II [3]
> FT   189 - 198: cleavage by topoisomerase II [3]
> FT   219 - 228: cleavage by topoisomerase II [3]
> FT   242 - 251: cleavage by topoisomerase II [3]
> FT   248 - 257: cleavage by topoisomerase II [3]
> FT   253 - 253: G in [3]
> FT   256 - 265: cleavage by topoisomerase II [3]
> XX   
> SF   topoisomerase II sites [1]; AT-rich sites [1];
> SF   contains a breakpoint for chromosomal translocation [3];
> SF   several short stretches of homopolymeric adenine or
> SF   thymine [7]
> XX   
> BP   75% [J. Bode, direct submission]; 20% [7]
> TP   constitutive [1]
> XX   
> FF   prototype of a S/MAR; contributes to maximal expression of
> FF   the kappa gene [2]; contributes to hypermutation [9];
> FF   contributes to kappa expression as shown by flow cytometic
> FF   assay, but has little effect on accumulation of the
> FF   respective mRNA [9]
> XX   
> CP   liver, kidney, spleen, thymus, MPC-11, P-815, L-cell [1]
> XX   
> EV   in vitro selection of S/MAR 
> EC   [J. Bode, direct submission]
> XX   
> BF   SB000002; lamin A [6]
> MM   nitrocellulose filter binding; 
> SO   rl; rat
> QA   6
> BF   SB000003; lamin B1 [6]
> MM   nitrocellulose filter binding; 
> SO   rl; rat
> QA   6
> BF   SB000004; lamin C [6]
> MM   nitrocellulose filter binding; 
> SO   rl; rat
> QA   6
> BF   SB000018; SP120 [4]
> MM   nitrocellulose filter binding; 
> SO   brain; rat
> QA   6
> BF   SB000018; SP120 [4]
> MM   southwestern blotting; 
> SO   brain; rat
> QA   6
> BF   SB000022; topoisomerase II [3]
> MM   gel retardation; 
> SO   Drosophila; Drosophila melanogaster
> QA   6
> BF   SB000022; topoisomerase II [3]
> MM   topoisomerase II cleavage assay; 
> SO   Drosophila; Drosophila melanogaster
> QA   6
> BF   SB000043; topoisomerase II [3]
> MM   topoisomerase II cleavage assay; 
> SO   calf; calf
> QA   6
> BF   SB000045; SMI1 [5]
> MM   functional analysis; 
> PR   254 bp fragment
> SO   yeast, extract; baker's yeast, Saccharomyces cerevisiae
> QA   6
> BF   SB000052; topoisomerase II [3]
> MM   topoisomerase II cleavage assay; 
> SO   mouse; mouse
> QA   6
> BF   SB000053; topoisomerase II [3]
> MM   nitrocellulose filter binding; 
> SO   HeLa; human
> QA   6
> BF   SB000067; SMAR1 [10]
> MM   gel shift competition; 
> SO   rec(mouse-E.coli); mouse
> QA   6
> BF   SB000077; SAF-A [12]
> MM   supershift (antibody binding); 
> SO   liver; mouse
> QA   6
> BF   SB000077; SAF-A [12]
> MM   southwestern blotting; 
> SO   liver; mouse
> QA   6
> XX   
> RN   [1]
> RX   MEDLINE; 86106203 PubMed; 3002631
> RA   Cockerill, P. N., Garrard, W. T.
> RT   Chromosomal loop anchorage of the kappa immunoglobin gene
> RT   occurs next to the enhancer in a region containing
> RT   topoisomerase II sites
> RL   Cell 44:273-282 (1986)
> RN   [2]
> RX   MEDLINE; 90078219 PubMed; 2512290
> RA   Blasquez, V. C., Xu, M., Moses, S. C., Garrard, W. T.
> RT   Immunoglobulin kappa gene expression after stable
> RT   integration. I. Role of the intronic MAR and enhancer in
> RT   plasmacytoma cells
> RL   J. Biol. Chem. 264:21183-21189 (1989)
> RN   [3]
> RX   MEDLINE; 89315824 PubMed; 2546156
> RA   Sperry, A. O., Blasquez, V. C., Garrard, W. T.
> RT   Dysfunction of chromosomal llop attachment sites:
> RT   Illegitimate recombination linked to matrix association
> RT   regions and topoisomerase II
> RL   Proc. Natl. Acad. Sci. USA 86:5497-5501 (1989)
> RN   [4]
> RX   MEDLINE; 93286136 PubMed; 8509422
> RA   Tsutsui, K., Tsutsui, K., Okada, S., Watarai, S., Seki, S.,
> RA   Yasuda, T., Shohmori, T.
> RT   Identification and characterization of a nuclear scaffold
> RT   protein that binds the matrix attachment region DNA
> RL   J. Biol. Chem. 268:12886-12894 (1993)
> RN   [5]
> RX   MEDLINE; 93296190 PubMed; 8516310
> RA   Fishel, B. R., Sperry, A. O., Garrard, W. T.
> RT   Yeast calmodulin and a conserved nuclear protein
> RT   participate in the in vivo binding of a matrix associated
> RT   region
> RL   Proc. Natl. Acad. Sci. USA 90:5623-5627 (1993)
> RN   [6]
> RX   MEDLINE; 94344140 PubMed; 8065361
> RA   Luderus, M. E. E., den Blaauwen, J. L., de Smit, O. J. B.,
> RA   Compton, D. A., van Driel, R.
> RT   Binding of matrix attachment regions to lamin polymers
> RT   involves single-stranded regions and the minor groove
> RL   Mol. Cell. Biol. 14:6297-6305 (1994)
> RN   [7]
> RX   MEDLINE; 96222527 PubMed; 8670229
> RA   Okada, S., Tsutsui, K., Tsutsui, K., Seki, S., Shohmori, T.
> RT   Subdomain structure of the matrix attachment region located
> RT   within the mouse immunoglobulin kappa gene intron
> RL   Biochem. Biophys. Res. Commun. 222:472-477 (1996)
> RN   [8]
> RA   Boulikas, T., Kong, C. F., Brooks, D., Hsie, L.
> RT   The 3' untranslated region of the human
> RT   poly(ADP-ribose)polymerase gene is a nuclear matrix
> RT   anchoring site
> RL   Int. J. Oncol. 9:1287-1294 (1996)
> RN   [9]
> RX   MEDLINE; 97377037 PubMed; 9233808
> RA   Goyenechea, B., Klix, N., Williams, G. T., Riddell, A.,
> RA   Neuberger, M. S., Milstein, C.
> RT   Cells strongly expressing Ig kappa transgenes show clonal
> RT   recruitment of hypermutation: a role for both MAR and the
> RT   enhancers
> RL   EMBO J. 16:3987-3994 (1997)
> RN   [10]
> RX   MEDLINE; 20408892 PubMed; 10950932
> RA   Chattopadhyay, S., Kaul, R., Charest, A., Housman, D.,
> RA   Chen, J.
> RT   SMAR1, a novel, alternatively spliced gene product, binds
> RT   the scaffold/matrix-associated region at the T cell
> RT   receptor beta locus
> RL   Genomics 68:93-96 (2000)
> RN   [11]
> RX   MEDLINE; 20496822 PubMed; 11041885
> RA   Morisawa, G., Han-yama, A., Moda, I., Tamai, A., Iwabuchi,
> RA   M., Meshi, T.
> RT   AHM1, a novel type of nuclear matrix-localized, MAR binding
> RT   protein with a single AT hook and a J
> RT   domain-homologous region
> RL   Plant Cell 12:1903-1916 (2000)
> RN   [12]
> RX   MEDLINE; 21456956 PubMed; 11573239
> RA   Lobov, I. B., Tsutsui, K., Mitchell, A. R., Podgornaya, O.
> RA   I.
> RT   Specificity of SAF-A and lamin B binding in vitro
> RT   correlates with the satellite DNA bending state
> RL   J. Cell. Biochem. 83:218-229 (2001)
> //
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby


More information about the BioRuby mailing list