[Biojava-dev] Writing Swissprot/Uniprot formatted files

Franklin Bristow fbristow at gmail.com
Fri Oct 17 18:58:08 UTC 2008


Hello everyone,
I've been doing some work with swissprot, and I've been needing to make use
of the file reading and writing facilities in biojava.

I was using biojava 1.5, but I've recently moved to using biojava-live so
that I can actually step through the code to see what's going on.

I have successfully created an index of my swissprot database and I can read
my sequences out of that indexed database.  All of the appropriate
information is loaded from the records in the file into the appropriate
objects.  I am quite happy with this.

The problem that I am having has to do with writing swissprot records.

When I started using biojava, the recommended way to do this was using
SeqIOTools:
SeqIOTools.writeSwissprot(byteStream, swissSequence);

While this works (ie: no exceptions are thrown), the record that is printed
to the byteStream looks pretty ugly (it's littered with XX lines) and is not
valid as per the current swissprot file spec (
http://www.expasy.ch/sprot/userman.html).  While this record is invalid, it
does contain all of the information that was originally in the swissprot
file.  I would include what I get as an output here, but it's irrelevant.

SeqIOTools became deprecated in favour of this:
RichSequence.IOTools.writeUniProt(byteStream, swissSequence, null);

Once again, while this works (and this time the record is valid), the record
that is printed contains almost none of the original information that is
contained in the swissprot record.  This is the output that I get when I
call this method (the spacing is may not look right because of fonts, but
that is not the problem):

ID   Q4UVA7_null             STANDARD;         273 AA.
> AC   Q4UVA7;
> DT   null, integrated into UniProtKB/?.
> DT   null, sequence version 0.
> DT   null, entry version 0.
> DE   null.
> FT   any           1    273
> FT   any         153    160
> SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
>      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
>      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
>      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
>      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
>      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
> //
>

But what I am expecting to see looks like this (again, the spacing is the
fault of the font, not the output):

> ID   Y1953_XANC8             Reviewed;         273 AA.
> AC   Q4UVA7;
> DT   10-JAN-2006, integrated into UniProtKB/Swiss-Prot.
> DT   05-JUL-2005, sequence version 1.
> DT   06-FEB-2007, entry version 12.
> DE   UPF0085 protein XC_1953.
> GN   OrderedLocusNames=XC_1953;
> OS   Xanthomonas campestris pv. campestris (strain 8004).
> OC   Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales;
> OC   Xanthomonadaceae; Xanthomonas.
> OX   NCBI_TaxID=314565;
> RN   [1]
> RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
> RX   PubMed=15899963; DOI=10.1101/gr.3378705;
> RA   Qian W., Jia Y., Ren S.-X., He Y.-Q., Feng J.-X., Lu L.-F., Sun Q.,
> RA   Ying G., Tang D.-J., Tang H., Wu W., Hao P., Wang L., Jiang B.-L.,
> RA   Zeng S., Gu W.-Y., Lu G., Rong L., Tian Y., Yao Z., Fu G., Chen B.,
> RA   Fang R., Qiang B., Chen Z., Zhao G.-P., Tang J.-L., He C.;
> RT   "Comparative and functional genomic analyses of the pathogenicity of
> RT   phytopathogen Xanthomonas campestris pv. campestris.";
> RL   Genome Res. 15:757-767(2005).
> CC   -!- SIMILARITY: Belongs to the UPF0085 family.
> CC   ------------------------------------------------------------
> -----------
> CC   Copyrighted by the UniProt Consortium, see
> http://www.uniprot.org/terms
> CC   Distributed under the Creative Commons Attribution-NoDerivs License
> CC   ------------------------------------------------------------
> -----------
> DR   EMBL; CP000050; AAY49016.1; -; Genomic_DNA.
> DR   GenomeReviews; CP000050_GR; XC_1953.
> DR   KEGG; xcb:XC_1953; -.
> DR   GO; GO:0005524; F:ATP binding; IEA:HAMAP.
> DR   HAMAP; MF_01062; -; 1.
> DR   InterPro; IPR005177; DUF299.
> DR   Pfam; PF03618; DUF299; 1.
> KW   ATP-binding; Complete proteome; Nucleotide-binding.
> FT   CHAIN         1    273       UPF0085 protein XC_1953.
> FT                                /FTId=PRO_0000196744.
> FT   NP_BIND     153    160       ATP (Potential).
> SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
>      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
>      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
>      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
>      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
>      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
> //
>

Needless to say, there is a considerable loss of information.

At first I wasn't sure if this was a problem with parsing the database that
I had, so I inspected the object that was retrieved from the database.  As I
mentioned before, the parsing seems to be working fine.  I get a
SimpleSequence object that has all of the correct annotations and other
information loaded into it.

I then continued to step through the writeUniProt method in
RichSequence.IOTools and found that this method first calls "enrich" on
SimpleSequence which turns it into a SimpleRichSequence.  There appears to
be some loss of information at this point, specifically in the feature set
where the 'key name' is lost -- it just becomes 'any'.

It is when we get to the actual process of writing to the stream in
UniprotFormat.writeSequence that we have the problems.  All of the code
appears to be there for printing the information out that I'm expecting.  I
think the problem is that in the process of "enrich"-ing the sequence, the
data is still stored in the object, but it is no longer where it is expected
to be.  For example, when we get to writing the comments out:
        // comments - if any
        if (!rs.getComments().isEmpty()) {

The List of comments IS empty, but there are comments in the
SimpleRichSequence, they are stored in the notes data member.

So.  After this lengthy explanation of my problem, I am wondering if I am
merely not doing this correctly.  Is there a better way to pass my
information to the writeUniprot method -- should I be transforming my
SimpleSequence objects into a SimpleRichSequence manually?  Am I just going
about this entirely the wrong way?

If I am going about this correctly and the functionality to do this is
merely not there or hasn't been implemented correctly, I would be more than
happy to help out...  I can supply patches, create bug reports, or anything
else that is necessary.

Any guidance in this matter would be greatly appreciated!

-- 
Franklin



More information about the biojava-dev mailing list