[Biojava-dev] Writing Swissprot/Uniprot formatted files

Mon Oct 20 13:36:15 UTC 2008

Hi Richard,
I'm getting my records from an indexed flat file.  I indexed the file using
IndexTools.indexSwissprot().  I am then retrieving the records from the flat
file "database" using the SequenceDBLite interface which is being provided
to me using the Registry and SystemRegistry classes.  The following a simple
example of what I am doing:

First I index the flat file:

> File[] files = new File[] { new File("/home/fbristow/db/uniprot_sprot.dat")
> };
> try {
>       IndexTools.indexSwissprot("uniprot_sprot", new
> File("/home/fbristow/db/index/uniprot_sprot"), files);
> } catch (BioException bioE) {
>       bioE.printStackTrace();
> } catch (ParserException parseE) {
>       parseE.printStackTrace();
> } catch (IOException ioE) {
>       ioE.printStackTrace();
> }

Then I get a handle on that file by doing:

> Registry registry = SystemRegistry.instance();
> setSwissDatabase(registry.getDatabase("swissprot"))
>

And I have a file in /etc that tells the registry how to find the indexes
with the swissprot identifier as per
http://biojava.org/docs/api/org/biojava/directory/SystemRegistry.html

Ultimately, this gives me a class that implements the interface
SequenceDBLite, and when I query this interface for sequences it returns to
me Sequence objects.  I can't seem to see anything that would give me a
RichSequence, so I think that I'll continue to get them in this manner, but
I'll convert the Sequence objects into RichSequence objects myself.

Thanks for your attention!

On Fri, Oct 17, 2008 at 3:08 PM, Richard Holland
<holland at eaglegenomics.com>wrote:

> Hello.
>
> I'm not sure how you're getting your uniprot records out of your swissprot
> database, or what format your swissprot database is in? If it's BioSQL, then
> the way BioJava interacts with it has altered significantly with BioJavaX -
> previous versions basically stuffed everything in as comments, hence all the
> XX lines you got when writing it back out again. However if it's not BioSQL
> and you've written something custom of your own, then I couldn't really
> comment!
>
> BioJavaX will attempt to convert the old sequence objects into rich
> sequence objects, but there's not much in common between the way uniprot
> data is stored in the old object model and the new one. Therefore the enrich
> method can't do a very good job - especially for stuff which the original
> parser stored as comments instead of properly distributing it across the
> object model. Data which the original parser stored in this comment format
> will mostly get ignored by the conversion process, because the conversion
> process has no idea where the record came from and therefore what to do with
> the comments inside it.
>
> Your best bet is to read your data out of your database directly as rich
> sequence objects, or if not possible, then do the conversion manually.
>
> cheers,
> Richard
>
>
> 2008/10/17 Franklin Bristow <fbristow at gmail.com>
>
>> Hello everyone,
>> I've been doing some work with swissprot, and I've been needing to make
>> use
>> of the file reading and writing facilities in biojava.
>>
>> I was using biojava 1.5, but I've recently moved to using biojava-live so
>> that I can actually step through the code to see what's going on.
>>
>> I have successfully created an index of my swissprot database and I can
>> read
>> my sequences out of that indexed database.  All of the appropriate
>> information is loaded from the records in the file into the appropriate
>> objects.  I am quite happy with this.
>>
>> The problem that I am having has to do with writing swissprot records.
>>
>> When I started using biojava, the recommended way to do this was using
>> SeqIOTools:
>> SeqIOTools.writeSwissprot(byteStream, swissSequence);
>>
>> While this works (ie: no exceptions are thrown), the record that is
>> printed
>> to the byteStream looks pretty ugly (it's littered with XX lines) and is
>> not
>> valid as per the current swissprot file spec (
>> http://www.expasy.ch/sprot/userman.html).  While this record is invalid,
>> it
>> does contain all of the information that was originally in the swissprot
>> file.  I would include what I get as an output here, but it's irrelevant.
>>
>> SeqIOTools became deprecated in favour of this:
>> RichSequence.IOTools.writeUniProt(byteStream, swissSequence, null);
>>
>> Once again, while this works (and this time the record is valid), the
>> record
>> that is printed contains almost none of the original information that is
>> contained in the swissprot record.  This is the output that I get when I
>> call this method (the spacing is may not look right because of fonts, but
>> that is not the problem):
>>
>> ID   Q4UVA7_null             STANDARD;         273 AA.
>> > AC   Q4UVA7;
>> > DT   null, integrated into UniProtKB/?.
>> > DT   null, sequence version 0.
>> > DT   null, entry version 0.
>> > DE   null.
>> > FT   any           1    273
>> > FT   any         153    160
>> > SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
>> >      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
>> >      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
>> >      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
>> >      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
>> >      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
>> > //
>> >
>>
>> But what I am expecting to see looks like this (again, the spacing is the
>> fault of the font, not the output):
>>
>> > ID   Y1953_XANC8             Reviewed;         273 AA.
>> > AC   Q4UVA7;
>> > DT   10-JAN-2006, integrated into UniProtKB/Swiss-Prot.
>> > DT   05-JUL-2005, sequence version 1.
>> > DT   06-FEB-2007, entry version 12.
>> > DE   UPF0085 protein XC_1953.
>> > GN   OrderedLocusNames=XC_1953;
>> > OS   Xanthomonas campestris pv. campestris (strain 8004).
>> > OC   Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales;
>> > OC   Xanthomonadaceae; Xanthomonas.
>> > OX   NCBI_TaxID=314565;
>> > RN   [1]
>> > RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
>> > RX   PubMed=15899963; DOI=10.1101/gr.3378705;
>> > RA   Qian W., Jia Y., Ren S.-X., He Y.-Q., Feng J.-X., Lu L.-F., Sun Q.,
>> > RA   Ying G., Tang D.-J., Tang H., Wu W., Hao P., Wang L., Jiang B.-L.,
>> > RA   Zeng S., Gu W.-Y., Lu G., Rong L., Tian Y., Yao Z., Fu G., Chen B.,
>> > RA   Fang R., Qiang B., Chen Z., Zhao G.-P., Tang J.-L., He C.;
>> > RT   "Comparative and functional genomic analyses of the pathogenicity
>> of
>> > RT   phytopathogen Xanthomonas campestris pv. campestris.";
>> > RL   Genome Res. 15:757-767(2005).
>> > CC   -!- SIMILARITY: Belongs to the UPF0085 family.
>> > CC   ------------------------------------------------------------
>> > -----------
>> > CC   Copyrighted by the UniProt Consortium, see
>> > http://www.uniprot.org/terms
>> > CC   Distributed under the Creative Commons Attribution-NoDerivs License
>> > CC   ------------------------------------------------------------
>> > -----------
>> > DR   EMBL; CP000050; AAY49016.1; -; Genomic_DNA.
>> > DR   GenomeReviews; CP000050_GR; XC_1953.
>> > DR   KEGG; xcb:XC_1953; -.
>> > DR   GO; GO:0005524; F:ATP binding; IEA:HAMAP.
>> > DR   HAMAP; MF_01062; -; 1.
>> > DR   InterPro; IPR005177; DUF299.
>> > DR   Pfam; PF03618; DUF299; 1.
>> > KW   ATP-binding; Complete proteome; Nucleotide-binding.
>> > FT   CHAIN         1    273       UPF0085 protein XC_1953.
>> > FT                                /FTId=PRO_0000196744.
>> > FT   NP_BIND     153    160       ATP (Potential).
>> > SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
>> >      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
>> >      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
>> >      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
>> >      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
>> >      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
>> > //
>> >
>>
>> Needless to say, there is a considerable loss of information.
>>
>> At first I wasn't sure if this was a problem with parsing the database
>> that
>> I had, so I inspected the object that was retrieved from the database.  As
>> I
>> mentioned before, the parsing seems to be working fine.  I get a
>> SimpleSequence object that has all of the correct annotations and other
>> information loaded into it.
>>
>> I then continued to step through the writeUniProt method in
>> RichSequence.IOTools and found that this method first calls "enrich" on
>> SimpleSequence which turns it into a SimpleRichSequence.  There appears to
>> be some loss of information at this point, specifically in the feature set
>> where the 'key name' is lost -- it just becomes 'any'.
>>
>> It is when we get to the actual process of writing to the stream in
>> UniprotFormat.writeSequence that we have the problems.  All of the code
>> appears to be there for printing the information out that I'm expecting.
>>  I
>> think the problem is that in the process of "enrich"-ing the sequence, the
>> data is still stored in the object, but it is no longer where it is
>> expected
>> to be.  For example, when we get to writing the comments out:
>>        // comments - if any
>>        if (!rs.getComments().isEmpty()) {
>>
>> The List of comments IS empty, but there are comments in the
>> SimpleRichSequence, they are stored in the notes data member.
>>
>> So.  After this lengthy explanation of my problem, I am wondering if I am
>> merely not doing this correctly.  Is there a better way to pass my
>> information to the writeUniprot method -- should I be transforming my
>> SimpleSequence objects into a SimpleRichSequence manually?  Am I just
>> going
>> about this entirely the wrong way?
>>
>> If I am going about this correctly and the functionality to do this is
>> merely not there or hasn't been implemented correctly, I would be more
>> than
>> happy to help out...  I can supply patches, create bug reports, or
>> anything
>> else that is necessary.
>>
>> Any guidance in this matter would be greatly appreciated!
>>
>> --
>> Franklin
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>

-- 
Franklin