[Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?

Carl Mäsak cmasak at gmail.com
Mon Nov 9 16:26:00 UTC 2009


Richard (>):
> Ah OK I see what's going on.
>
> The convenience method you're using, RichSequence.IOTools.readStream(), uses
> FastaFormat to try and guess the alphabet to use based on the first line of
> the input sequence.
>
> In FastaFormat, it does this by searching for matching non-DNA symbols. The
> search is case-sensitive:
>
>        protected static final Pattern aminoAcids =
> Pattern.compile(".*[FLIPQE].*");
>
> FastaFormat needs patching to make this pattern non-case-sensitive.

Patch attached.

I also took the opportunity to remove the occurrences of .* in the
Pattern above. Generally, once should be using Matcher.find() when one
is interested in matching a part of a string. This is more efficient
than using Matcher.matches() and surrounding the desired regular
expression with .*, since the latter will cause a lot of unnecessary
backtracking and make the search quadratic.

This effect only shows up for very long strings, but long strings can
and do happen in bioinformatics. The below measurements show the
quadratic behaviour of the former approach.

$ for length in 100 1000 10000 100000 1000000; do (time java
WithDotStar $length) 2>&1 | grep real; done
real	0m0.371s
real	0m0.367s
real	0m0.577s
real	0m2.735s
real	0m25.275s

$ for length in 100 1000 10000 100000 1000000; do (time java
WithoutDotStar $length) 2>&1 | grep real; done
real	0m0.309s
real	0m0.361s
real	0m0.468s
real	0m1.184s
real	0m9.703s

Kindly,
// Carl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: aminoAcids.patch
Type: application/octet-stream
Size: 1995 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20091109/d9122878/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: WithDotStar.java
Type: application/octet-stream
Size: 634 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20091109/d9122878/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: WithoutDotStar.java
Type: application/octet-stream
Size: 633 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20091109/d9122878/attachment-0002.obj>


More information about the Biojava-l mailing list