[Biojava-dev] Writing Swissprot/Uniprot formatted files

Richard Holland holland at eaglegenomics.com
Mon Oct 20 14:17:34 UTC 2008


Wow, I didn't know anyone was actually using the registry thing. I certainly
never have! That's probably why it was left out of the whole update to
RichSequences. There will probably be equivalent functionality in BioJava3
at some point but I doubt anyone will backport the RichSequence updates to
the existing registry setup (unless there's any volunteers!).

Good luck with the conversion process.

cheers,
Richard

2008/10/20 Franklin Bristow <fbristow at gmail.com>

> Hi Richard,
> I'm getting my records from an indexed flat file.  I indexed the file using
> IndexTools.indexSwissprot().  I am then retrieving the records from the flat
> file "database" using the SequenceDBLite interface which is being provided
> to me using the Registry and SystemRegistry classes.  The following a simple
> example of what I am doing:
>
> First I index the flat file:
>
>> File[] files = new File[] { new
>> File("/home/fbristow/db/uniprot_sprot.dat") };
>> try {
>>       IndexTools.indexSwissprot("uniprot_sprot", new
>> File("/home/fbristow/db/index/uniprot_sprot"), files);
>> } catch (BioException bioE) {
>>       bioE.printStackTrace();
>> } catch (ParserException parseE) {
>>       parseE.printStackTrace();
>> } catch (IOException ioE) {
>>       ioE.printStackTrace();
>> }
>
>
> Then I get a handle on that file by doing:
>
>> Registry registry = SystemRegistry.instance();
>> setSwissDatabase(registry.getDatabase("swissprot"))
>>
>
> And I have a file in /etc that tells the registry how to find the indexes
> with the swissprot identifier as per
> http://biojava.org/docs/api/org/biojava/directory/SystemRegistry.html
>
> Ultimately, this gives me a class that implements the interface
> SequenceDBLite, and when I query this interface for sequences it returns to
> me Sequence objects.  I can't seem to see anything that would give me a
> RichSequence, so I think that I'll continue to get them in this manner, but
> I'll convert the Sequence objects into RichSequence objects myself.
>
> Thanks for your attention!
>
>
> On Fri, Oct 17, 2008 at 3:08 PM, Richard Holland <
> holland at eaglegenomics.com> wrote:
>
>> Hello.
>>
>> I'm not sure how you're getting your uniprot records out of your swissprot
>> database, or what format your swissprot database is in? If it's BioSQL, then
>> the way BioJava interacts with it has altered significantly with BioJavaX -
>> previous versions basically stuffed everything in as comments, hence all the
>> XX lines you got when writing it back out again. However if it's not BioSQL
>> and you've written something custom of your own, then I couldn't really
>> comment!
>>
>> BioJavaX will attempt to convert the old sequence objects into rich
>> sequence objects, but there's not much in common between the way uniprot
>> data is stored in the old object model and the new one. Therefore the enrich
>> method can't do a very good job - especially for stuff which the original
>> parser stored as comments instead of properly distributing it across the
>> object model. Data which the original parser stored in this comment format
>> will mostly get ignored by the conversion process, because the conversion
>> process has no idea where the record came from and therefore what to do with
>> the comments inside it.
>>
>> Your best bet is to read your data out of your database directly as rich
>> sequence objects, or if not possible, then do the conversion manually.
>>
>> cheers,
>> Richard
>>
>>
>> 2008/10/17 Franklin Bristow <fbristow at gmail.com>
>>
>>> Hello everyone,
>>> I've been doing some work with swissprot, and I've been needing to make
>>> use
>>> of the file reading and writing facilities in biojava.
>>>
>>> I was using biojava 1.5, but I've recently moved to using biojava-live so
>>> that I can actually step through the code to see what's going on.
>>>
>>> I have successfully created an index of my swissprot database and I can
>>> read
>>> my sequences out of that indexed database.  All of the appropriate
>>> information is loaded from the records in the file into the appropriate
>>> objects.  I am quite happy with this.
>>>
>>> The problem that I am having has to do with writing swissprot records.
>>>
>>> When I started using biojava, the recommended way to do this was using
>>> SeqIOTools:
>>> SeqIOTools.writeSwissprot(byteStream, swissSequence);
>>>
>>> While this works (ie: no exceptions are thrown), the record that is
>>> printed
>>> to the byteStream looks pretty ugly (it's littered with XX lines) and is
>>> not
>>> valid as per the current swissprot file spec (
>>> http://www.expasy.ch/sprot/userman.html).  While this record is invalid,
>>> it
>>> does contain all of the information that was originally in the swissprot
>>> file.  I would include what I get as an output here, but it's irrelevant.
>>>
>>> SeqIOTools became deprecated in favour of this:
>>> RichSequence.IOTools.writeUniProt(byteStream, swissSequence, null);
>>>
>>> Once again, while this works (and this time the record is valid), the
>>> record
>>> that is printed contains almost none of the original information that is
>>> contained in the swissprot record.  This is the output that I get when I
>>> call this method (the spacing is may not look right because of fonts, but
>>> that is not the problem):
>>>
>>> ID   Q4UVA7_null             STANDARD;         273 AA.
>>> > AC   Q4UVA7;
>>> > DT   null, integrated into UniProtKB/?.
>>> > DT   null, sequence version 0.
>>> > DT   null, entry version 0.
>>> > DE   null.
>>> > FT   any           1    273
>>> > FT   any         153    160
>>> > SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
>>> >      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
>>> >      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
>>> >      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
>>> >      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
>>> >      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
>>> > //
>>> >
>>>
>>> But what I am expecting to see looks like this (again, the spacing is the
>>> fault of the font, not the output):
>>>
>>> > ID   Y1953_XANC8             Reviewed;         273 AA.
>>> > AC   Q4UVA7;
>>> > DT   10-JAN-2006, integrated into UniProtKB/Swiss-Prot.
>>> > DT   05-JUL-2005, sequence version 1.
>>> > DT   06-FEB-2007, entry version 12.
>>> > DE   UPF0085 protein XC_1953.
>>> > GN   OrderedLocusNames=XC_1953;
>>> > OS   Xanthomonas campestris pv. campestris (strain 8004).
>>> > OC   Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales;
>>> > OC   Xanthomonadaceae; Xanthomonas.
>>> > OX   NCBI_TaxID=314565;
>>> > RN   [1]
>>> > RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
>>> > RX   PubMed=15899963; DOI=10.1101/gr.3378705;
>>> > RA   Qian W., Jia Y., Ren S.-X., He Y.-Q., Feng J.-X., Lu L.-F., Sun
>>> Q.,
>>> > RA   Ying G., Tang D.-J., Tang H., Wu W., Hao P., Wang L., Jiang B.-L.,
>>> > RA   Zeng S., Gu W.-Y., Lu G., Rong L., Tian Y., Yao Z., Fu G., Chen
>>> B.,
>>> > RA   Fang R., Qiang B., Chen Z., Zhao G.-P., Tang J.-L., He C.;
>>> > RT   "Comparative and functional genomic analyses of the pathogenicity
>>> of
>>> > RT   phytopathogen Xanthomonas campestris pv. campestris.";
>>> > RL   Genome Res. 15:757-767(2005).
>>> > CC   -!- SIMILARITY: Belongs to the UPF0085 family.
>>> > CC   ------------------------------------------------------------
>>> > -----------
>>> > CC   Copyrighted by the UniProt Consortium, see
>>> > http://www.uniprot.org/terms
>>> > CC   Distributed under the Creative Commons Attribution-NoDerivs
>>> License
>>> > CC   ------------------------------------------------------------
>>> > -----------
>>> > DR   EMBL; CP000050; AAY49016.1; -; Genomic_DNA.
>>> > DR   GenomeReviews; CP000050_GR; XC_1953.
>>> > DR   KEGG; xcb:XC_1953; -.
>>> > DR   GO; GO:0005524; F:ATP binding; IEA:HAMAP.
>>> > DR   HAMAP; MF_01062; -; 1.
>>> > DR   InterPro; IPR005177; DUF299.
>>> > DR   Pfam; PF03618; DUF299; 1.
>>> > KW   ATP-binding; Complete proteome; Nucleotide-binding.
>>> > FT   CHAIN         1    273       UPF0085 protein XC_1953.
>>> > FT                                /FTId=PRO_0000196744.
>>> > FT   NP_BIND     153    160       ATP (Potential).
>>> > SQ   SEQUENCE   273 AA;  30853 MW;  604FB6C6437A9D90 CRC64;
>>> >      MSTIRPVFYV SDGTGITAET IGHSLLTQFS GFNFVTDRMS FIDDADKARD AALRVRAAGE
>>> >      RYQVRPVVVN SCVDPQLSMI LAESGALMLD VFAPFIEPLE RELNAPRHSR VGRAHGMVDF
>>> >      ETYHRRINAM NFALSHDDGI ALNYDEADVI LVAVSRAGKT PTCIYLALHY GIRAANYPLT
>>> >      EEDLESERLP PRLRNYRSKL FGLTIDPERL QQIRQERRAN SRYSAAETCR REVATAERMF
>>> >      QMERIPTLST TNTSIEEISS KVLSTLGLQR EMF
>>> > //
>>> >
>>>
>>> Needless to say, there is a considerable loss of information.
>>>
>>> At first I wasn't sure if this was a problem with parsing the database
>>> that
>>> I had, so I inspected the object that was retrieved from the database.
>>>  As I
>>> mentioned before, the parsing seems to be working fine.  I get a
>>> SimpleSequence object that has all of the correct annotations and other
>>> information loaded into it.
>>>
>>> I then continued to step through the writeUniProt method in
>>> RichSequence.IOTools and found that this method first calls "enrich" on
>>> SimpleSequence which turns it into a SimpleRichSequence.  There appears
>>> to
>>> be some loss of information at this point, specifically in the feature
>>> set
>>> where the 'key name' is lost -- it just becomes 'any'.
>>>
>>> It is when we get to the actual process of writing to the stream in
>>> UniprotFormat.writeSequence that we have the problems.  All of the code
>>> appears to be there for printing the information out that I'm expecting.
>>>  I
>>> think the problem is that in the process of "enrich"-ing the sequence,
>>> the
>>> data is still stored in the object, but it is no longer where it is
>>> expected
>>> to be.  For example, when we get to writing the comments out:
>>>        // comments - if any
>>>        if (!rs.getComments().isEmpty()) {
>>>
>>> The List of comments IS empty, but there are comments in the
>>> SimpleRichSequence, they are stored in the notes data member.
>>>
>>> So.  After this lengthy explanation of my problem, I am wondering if I am
>>> merely not doing this correctly.  Is there a better way to pass my
>>> information to the writeUniprot method -- should I be transforming my
>>> SimpleSequence objects into a SimpleRichSequence manually?  Am I just
>>> going
>>> about this entirely the wrong way?
>>>
>>> If I am going about this correctly and the functionality to do this is
>>> merely not there or hasn't been implemented correctly, I would be more
>>> than
>>> happy to help out...  I can supply patches, create bug reports, or
>>> anything
>>> else that is necessary.
>>>
>>> Any guidance in this matter would be greatly appreciated!
>>>
>>> --
>>> Franklin
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>
>>
>>
>> --
>> Richard Holland, BSc MBCS
>> Finance Director, Eagle Genomics Ltd
>> M: +44 7500 438846 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>
>
>
> --
> Franklin
>



-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the biojava-dev mailing list