[Biojava-l] bioJava & swissProt parsering

Matthew Pocock matthew_pocock@yahoo.co.uk
Fri, 17 May 2002 16:35:41 +0100


Yes.

It works for 90% + of all entries in swissprot40. 
SeqIOTools.readSwissprot() should do the trick. You will get back a 
normal SequenceIterator object over all the swissprot entries in the 
file. The entries it fails on tend to have a domain at coordinate 0 
which is an initiations site. The feature creation logic in BioJava sees 
this as a feature with illegal coordinates and barfs. I keep meaning to 
find a work-around for this, but never get the time. The options are:

1) throw an informative exception

2)ignore the 'bad' features

3) put feature templates for 'bad' features in a well-known annotatin 
property

4) relax the feature creation checks

My vote is for 3. It keeps as much of the data that is valid available, 
while still allowing access to the rest (to facilitate round-tripping).

Matthew

wanner.de@pg.com wrote:
> Hi,
> 
> Has anyone used bioJava to parse a swissProt record.  I've attached an example
> below -- it's format looks much different from
> either genbank or refseq.
> 
> thx,
> Dave
> 
> 
> ID   100K_RAT       STANDARD;      PRT;   889 AA.
> AC   Q62671;
> DT   01-NOV-1997 (Rel. 35, Created)
> DT   01-NOV-1997 (Rel. 35, Last sequence update)
> DT   16-OCT-2001 (Rel. 40, Last annotation update)
> DE   100 kDa protein (ENZYME: 6.3.2.-).
> OS   Rattus norvegicus (Rat).
> OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> OC   Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
> OX   NCBI_TaxID=10116;
> RN   [1]
> RP   SEQUENCE FROM N.A.
> RC   STRAIN=WISTAR; TISSUE=Testis;
> RX   MEDLINE=92253337: PubMed=1533713;
> RA   Mueller D., Rehbein M., Baumeister H., Richter D.;
> RT   "Molecular characterization of a novel rat protein structurally
> RT   related to poly(A) binding proteins and the 70K protein of the U1
> RT   small nuclear ribonucleoprotein particle (snRNP).";
> RL   Nucleic Acids Res. 20:1471-1475(1992).
> RN   [2]
> RP   ERRATUM.
> RA   Mueller D., Rehbein M., Baumeister H., Richter D.;
> RL   Nucleic Acids Res. 20:2624-2624(1992).
> CC   -!- FUNCTION: E3 UBIQUITIN-PROTEIN LIGASE WHICH ACCEPTS UBIQUITIN FROM
> CC       AN E2 UBIQUITIN-CONJUGATING ENZYME IN THE FORM OF A THIOESTER AND
> CC       THEN DIRECTLY TRANSFERS THE UBIQUITIN TO TARGETED SUBSTRATES (BY
> CC       SIMILARITY). THIS PROTEIN MAY BE INVOLVED IN MATURATION AND/OR
> CC       POST-TRANSCRIPTIONAL REGULATION OF MRNA.
> CC   -!- TISSUE SPECIFICITY: HIGHEST LEVELS FOUND IN TESTIS. ALSO PRESENT
> CC       IN LIVER, KIDNEY, LUNG AND BRAIN.
> CC   -!- DEVELOPMENTAL STAGE: IN EARLY POST-NATAL LIFE, EXPRESSION IN
> CC       THE TESTIS INCREASES TO REACH A MAXIMUM AROUND DAY 28.
> CC   -!- MISCELLANEOUS: A CYSTEINE RESIDUE IS REQUIRED FOR
> CC       UBIQUITIN-THIOLESTER FORMATION.
> CC   -!- SIMILARITY: A CENTRAL REGION (AA 485-514) IS SIMILAR TO THE
> CC       C-TERMINAL DOMAINS OF MAMMALIAN AND YEAST POLY (A) RNA BINDING
> CC       PROTEINS (PABP).
> CC   -!- SIMILARITY: CONTAINS MIXED-CHARGE DOMAINS SIMILAR TO RNA-BINDING
> CC       PROTEINS.
> CC   -!- SIMILARITY: CONTAINS 1 HECT-TYPE E3 UBIQUITIN-PROTEIN LIGASE
> CC       DOMAIN.
> CC   --------------------------------------------------------------------------
> CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
> CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
> CC   the European Bioinformatics Institute.  There are no  restrictions on  its
> CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
> CC   modified and this statement is not removed.  Usage  by  and for commercial
> CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
> CC   or send an email to license@isb-sib.ch).
> CC   --------------------------------------------------------------------------
> DR   EMBL: X64411
> DR   InterPro: IPR000569
> DR    IPR002004
> DR   Pfam: PF00632
> DR    PF00658
> DR   SMART: SM00119
> DR    SM00517
> DR   PROSITE: PS50237
> KW   Ubiquitin conjugation; Ligase.
> FT   DOMAIN       77     88       ASP/GLU-RICH (ACIDIC).
> FT   DOMAIN      127    150       PRO-RICH.
> FT   DOMAIN      420    439       ARG/GLU-RICH (MIXED CHARGE).
> FT   DOMAIN      448    457       ARG/ASP-RICH (MIXED CHARGE).
> FT   DOMAIN      485    514       PABP-LIKE.
> FT   DOMAIN      579    590       ASP/GLU-RICH (ACIDIC).
> FT   DOMAIN      786    889       HECT.
> FT   DOMAIN      827    847       PRO-RICH.
> FT   BINDING     858    858       UBIQUITIN (BY SIMILARITY).
> SQ   SEQUENCE   889 AA;  100368 MW;  ABD7E3CD53961B78 CRC64;
>      MMSARGDFLN YALSLMRSHN DEHSDVLPVL DVCSLKHVAY VFQALIYWIK AMNQQTTLDT
>      PQLERKRTRE LLELGIDNED SEHENDDDTS QSATLNDKDD ESLPAETGQN HPFFRRSDSM
>      TFLGCIPPNP FEVPLAEAIP LADQPHLLQP NARKEDLFGR PSQGLYSSSA GSGKCLVEVT
>      MDRNCLEVLP TKMSYAANLK NVMNMQNRQK KAGEDQSMLA EEADSSKPGP SAHDVAAQLK
>      SSLLAEIGLT ESEGPPLTSF RPQCSFMGMV ISHDMLLGRW RLSLELFGRV FMEDVGAEPG
>      SILTELGGFE VKESKFRREM EKLRNQQSRD LSLEVDRDRD LLIQQTMRQL NNHFGRRCAT
>      TPMAVHRVKV TFKDEPGEGS GVARSFYTAI AQAFLSNEKL PNLDCIQNAN KGTHTSLMQR
>      LRNRGERDRE REREREMRRS SGLRAGSRRD RDRDFRRQLS IDTRPFRPAS EGNPSDDPDP
>      LPAHRQALGE RLYPRVQAMQ PAFASKITGM LLELSPAQLL LLLASEDSLR ARVEEAMELI
>      VAHGRENGAD SILDLGLLDS SEKVQENRKR HGSSRSVVDM DLDDTDDGDD NAPLFYQPGK
>      RGFYTPRPGK NTEARLNCFR NIGRILGLCL LQNELCPITL NRHVIKVLLG RKVNWHDFAF
>      FDPVMYESLR QLILASQSSD ADAVFSAMDL AFAVDLCKEE GGGQVELIPN GVNIPVTPQN
>      VYEYVRKYAE HRMLVVAEQP LHAMRKGLLD VLPKNSLEDL TAEDFRLLVN GCGEVNVQML
>      ISFTSFNDES GENAEKLLQF KRWFWSIVER MSMTERQDLV YFWTSSPSLP ASEEGFQPMP
>      SITIRPPDDQ HLPTANTCIS RLYVPLYSSK QILKQKLLLA IKTKNFGFV
> //
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>