[Biojava-dev] [Bug 3137] New: RichSequence.IOTools.readEMBLProtein() fails on EMBL patent protein entries.

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Sep 23 04:35:54 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=3137

           Summary: RichSequence.IOTools.readEMBLProtein() fails on EMBL
                    patent protein entries.
           Product: BioJava
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: seq.io
        AssignedTo: biojava-dev at biojava.org
        ReportedBy: dkoellhofer at gmail.com


Hi,

I'm trying to parse EMBL formatted files with
RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID lines don't
match.

Looks like the parser utilises the EMBLFormat class with the following ID
pattern:
protected static final Pattern lp =
Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$");

The ID lines in my files (retrieved from EMBL-EBI) look like ID   A00197; SV 1;
linear; protein; PRT; SYN; 602 AA.

Looks like the pattern is specifically written for dna/rna and should more look
like:

protected static final Pattern lp =
Pattern.compile("^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+(BP|AA)\\.$");

The failing protein sequences come from the embl patent protein database:
ftp://ftp.ebi.ac.uk/pub/databases/embl/patent/epo_prt.dat.gz

Cheers,

Deniz


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the biojava-dev mailing list