[Biojava-dev] Parsing protein EMBL files with RichSequence.IOTools.readEMBLProtein

Deniz Koellhofer deniz.koellhofer at cambia.org
Tue Sep 21 22:59:27 UTC 2010


Hi,

I'm trying to parse EMBL formatted files
with RichSequence.IOTools.readEMBLProtein() - but the pattern for the ID
lines don't match.

Looks like the parser utilises the EMBLFormat class with the following ID
pattern:

*protected** **static** **final** Pattern **lp** = Pattern.compile(**
"^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);\\s+(\\d+)\\s+BP\\.$"
**);*

The ID lines in my files (retrieved from EMBL-EBI) look like *ID   A00197;
SV 1; linear; protein; PRT; SYN; 602 AA.*

Looks like the pattern is specifically written for dna/rna and should more
look like:

*protected** **static** **final** Pattern **lp** = Pattern.compile(**
"^(\\S+);\\s+SV\\s+(\\d+);\\s+(linear|circular);\\s+(\\S+\\s?\\S+?);\\s+(\\S+);\\s+(\\S+);
**\\s+(\\d+)\\s+(BP|AA)\\.$"**);*

Or am I using he wrong RichSequence.IOTools function?

Cheers,

Deniz
-- 
Deniz Koellhofer
Cambia
Initiative for Open Innovation (IOI)
Cambia at QUT, G301, 2 George Street, Brisbane Qld 4000, Australia



More information about the biojava-dev mailing list