[Biojava-dev] Writing Swissprot/Uniprot formatted files

Richard Holland holland at eaglegenomics.com
Fri Oct 17 20:08:25 UTC 2008


Hello.

I'm not sure how you're getting your uniprot records out of your swissprot
database, or what format your swissprot database is in? If it's BioSQL, then
the way BioJava interacts with it has altered significantly with BioJavaX -
previous versions basically stuffed everything in as comments, hence all the
XX lines you got when writing it back out again. However if it's not BioSQL
and you've written something custom of your own, then I couldn't really
comment!

BioJavaX will attempt to convert the old sequence objects into rich sequence
objects, but there's not much in common between the way uniprot data is
stored in the old object model and the new one. Therefore the enrich method
can't do a very good job - especially for stuff which the original parser
stored as comments instead of properly distributing it across the object
model. Data which the original parser stored in this comment format will
mostly get ignored by the conversion process, because the conversion process
has no idea where the record came from and therefore what to do with the
comments inside it.

Your best bet is to read your data out of your database directly as rich
sequence objects, or if not possible, then do the conversion manually.

cheers,
Richard


2008/10/17 Franklin Bristow <fbristow at gmail.com>

> Hello everyone,
> I've been doing some work with swissprot, and I've been needing to make use
> of the file reading and writing facilities in biojava.
>
> I was using biojava 1.5, but I've recently moved to using biojava-live so
> that I can actually step through the code to see what's going on.
>
> I have successfully created an index of my swissprot database and I can
> read
> my sequences out of that indexed database.  All of the appropriate
> information is loaded from the records in the file into the appropriate
> objects.  I am quite happy with this.
>
> The problem that I am having has to do with writing swissprot records.
>
> When I started using biojava, the recommended way to do this was using
> SeqIOTools:
> SeqIOTools.writeSwissprot(byteStream, swissSequence);
>
> While this works (ie: no exceptions are thrown), the record that is printed
> to the byteStream looks pretty ugly (it's littered with XX lines) and is
> not
> valid as per the current swissprot file spec (
> http://www.expasy.ch/sprot/userman.html).  While this record is invalid,
> it
> does contain all of the information that was originally in the swissprot
> file.  I would include what I get as an output here, but it's irrelevant.
>
> SeqIOTools became deprecated in favour of this:
> RichSequence.IOTools.writeUniProt(byteStream, swissSequence, null);
>
> Once again, while this works (and this time the record is valid), the
> record
> that is printed contains almost none of the original information that is
> contained in the swissprot record.  This is the output that I get when I
> call this method (the spacing is may not look right because of fonts, but
> that is not the problem):
>
> ID   Q4UVA7_null             STANDARD;         273 AA.
> > AC   Q4UVA7;
> > DT   null, integrated into UniProtKB/?.
> > DT   null, sequence version 0.
> > DT   null, entry version 0.
> > DE   null.
> > FT   any           1    273
> > FT   any         153    160
> > SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
> >      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
> >      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
> >      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
> >      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
> >      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
> > //
> >
>
> But what I am expecting to see looks like this (again, the spacing is the
> fault of the font, not the output):
>
> > ID   Y1953_XANC8             Reviewed;         273 AA.
> > AC   Q4UVA7;
> > DT   10-JAN-2006, integrated into UniProtKB/Swiss-Prot.
> > DT   05-JUL-2005, sequence version 1.
> > DT   06-FEB-2007, entry version 12.
> > DE   UPF0085 protein XC_1953.
> > GN   OrderedLocusNames=XC_1953;
> > OS   Xanthomonas campestris pv. campestris (strain 8004).
> > OC   Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales;
> > OC   Xanthomonadaceae; Xanthomonas.
> > OX   NCBI_TaxID=314565;
> > RN   [1]
> > RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
> > RX   PubMed=15899963; DOI=10.1101/gr.3378705;
> > RA   Qian W., Jia Y., Ren S.-X., He Y.-Q., Feng J.-X., Lu L.-F., Sun Q.,
> > RA   Ying G., Tang D.-J., Tang H., Wu W., Hao P., Wang L., Jiang B.-L.,
> > RA   Zeng S., Gu W.-Y., Lu G., Rong L., Tian Y., Yao Z., Fu G., Chen B.,
> > RA   Fang R., Qiang B., Chen Z., Zhao G.-P., Tang J.-L., He C.;
> > RT   "Comparative and functional genomic analyses of the pathogenicity of
> > RT   phytopathogen Xanthomonas campestris pv. campestris.";
> > RL   Genome Res. 15:757-767(2005).
> > CC   -!- SIMILARITY: Belongs to the UPF0085 family.
> > CC   ------------------------------------------------------------
> > -----------
> > CC   Copyrighted by the UniProt Consortium, see
> > http://www.uniprot.org/terms
> > CC   Distributed under the Creative Commons Attribution-NoDerivs License
> > CC   ------------------------------------------------------------
> > -----------
> > DR   EMBL; CP000050; AAY49016.1; -; Genomic_DNA.
> > DR   GenomeReviews; CP000050_GR; XC_1953.
> > DR   KEGG; xcb:XC_1953; -.
> > DR   GO; GO:0005524; F:ATP binding; IEA:HAMAP.
> > DR   HAMAP; MF_01062; -; 1.
> > DR   InterPro; IPR005177; DUF299.
> > DR   Pfam; PF03618; DUF299; 1.
> > KW   ATP-binding; Complete proteome; Nucleotide-binding.
> > FT   CHAIN         1    273       UPF0085 protein XC_1953.
> > FT                                /FTId=PRO_0000196744.
> > FT   NP_BIND     153    160       ATP (Potential).
> > SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
> >      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
> >      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
> >      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
> >      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
> >      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
> > //
> >
>
> Needless to say, there is a considerable loss of information.
>
> At first I wasn't sure if this was a problem with parsing the database that
> I had, so I inspected the object that was retrieved from the database.  As
> I
> mentioned before, the parsing seems to be working fine.  I get a
> SimpleSequence object that has all of the correct annotations and other
> information loaded into it.
>
> I then continued to step through the writeUniProt method in
> RichSequence.IOTools and found that this method first calls "enrich" on
> SimpleSequence which turns it into a SimpleRichSequence.  There appears to
> be some loss of information at this point, specifically in the feature set
> where the 'key name' is lost -- it just becomes 'any'.
>
> It is when we get to the actual process of writing to the stream in
> UniprotFormat.writeSequence that we have the problems.  All of the code
> appears to be there for printing the information out that I'm expecting.  I
> think the problem is that in the process of "enrich"-ing the sequence, the
> data is still stored in the object, but it is no longer where it is
> expected
> to be.  For example, when we get to writing the comments out:
>        // comments - if any
>        if (!rs.getComments().isEmpty()) {
>
> The List of comments IS empty, but there are comments in the
> SimpleRichSequence, they are stored in the notes data member.
>
> So.  After this lengthy explanation of my problem, I am wondering if I am
> merely not doing this correctly.  Is there a better way to pass my
> information to the writeUniprot method -- should I be transforming my
> SimpleSequence objects into a SimpleRichSequence manually?  Am I just going
> about this entirely the wrong way?
>
> If I am going about this correctly and the functionality to do this is
> merely not there or hasn't been implemented correctly, I would be more than
> happy to help out...  I can supply patches, create bug reports, or anything
> else that is necessary.
>
> Any guidance in this matter would be greatly appreciated!
>
> --
> Franklin
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the biojava-dev mailing list