[Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein

George Waldon gwaldon at geneinfinity.org
Thu Sep 23 04:23:48 UTC 2010


Curious, this seems to be the only place to find this type of files. Not really an official format, a little bit like GenPept. Your fix should probably work. Can you fill a bug on bugzilla (http://bugzilla.open-bio.org/)?

Best,
George

On Wed, Sep 22, 2010 at 4:10 PM, Deniz Koellhofer <deniz.koellhofer at cambia.org> wrote:

    Hi George,

    This entry is from the embl patent protein database: ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz

    Have you used the RichSequence.IOTools successfully for parsing EMBL protein files before? I assume this should always fail due to the "BP" in the regex?

    Deniz

    On Thu, Sep 23, 2010 at 7:18 AM, George Waldon <gwaldon at geneinfinity.org> wrote:

        Hi Deniz:

        I have a quick question that may be obvious, but which database do you get those protein files from?

        Thank you,

        George


        On Tue, Sep 21, 2010 at 3:59 PM, Deniz Koellhofer <deniz.koellhofer at cambia.org> wrote:

           Hi,

           I'm trying to parse EMBL formatted files
           with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID
           lines don't match.

           Looks like the parser utilises the EMBLFormat class with the following ID
           pattern:

           *protected** **static** **final** Pattern **lp** = Pattern.compile(**
           "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$"
           **);*

           The ID lines in my files (retrieved from EMBL-EBI) look like *ID   A00197;
           SV 1; linear; protein; PRT; SYN; 602 AA.*

           Looks like the pattern is specifically written for dna/rna and should more
           look like:

           *protected** **static** **final** Pattern **lp** = Pattern.compile(**
           "^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);
           **\\s+(\\d+)\\s+(BP|AA)\\.$"**);*

           Or am I using he wrong RichSequence.IOTools function?

           Cheers,

           Deniz
           --
           Deniz Koellhofer
           Cambia
           Initiative for Open Innovation (IOI)
           Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia
           _______________________________________________
           biojava-dev mailing list
           biojava-dev at lists.open-bio.org
           http://lists.open-bio.org/mailman/listinfo/biojava-dev







More information about the biojava-dev mailing list