[Biojava-l] How do I read a FASTA file containing protein sequences in lowercase?

Richard Holland holland at eaglegenomics.com
Fri Nov 13 15:04:27 UTC 2009


I've applied the patch to the trunk of biojava-live. Thanks!

Richard

On 9 Nov 2009, at 16:26, Carl Mäsak wrote:

> Richard (>):
>> Ah OK I see what's going on.
>> 
>> The convenience method you're using, RichSequence.IOTools.readStream(), uses
>> FastaFormat to try and guess the alphabet to use based on the first line of
>> the input sequence.
>> 
>> In FastaFormat, it does this by searching for matching non-DNA symbols. The
>> search is case-sensitive:
>> 
>>        protected static final Pattern aminoAcids =
>> Pattern.compile(".*[FLIPQE].*");
>> 
>> FastaFormat needs patching to make this pattern non-case-sensitive.
> 
> Patch attached.
> 
> I also took the opportunity to remove the occurrences of .* in the
> Pattern above. Generally, once should be using Matcher.find() when one
> is interested in matching a part of a string. This is more efficient
> than using Matcher.matches() and surrounding the desired regular
> expression with .*, since the latter will cause a lot of unnecessary
> backtracking and make the search quadratic.
> 
> This effect only shows up for very long strings, but long strings can
> and do happen in bioinformatics. The below measurements show the
> quadratic behaviour of the former approach.
> 
> $ for length in 100 1000 10000 100000 1000000; do (time java
> WithDotStar $length) 2>&1 | grep real; done
> real	0m0.371s
> real	0m0.367s
> real	0m0.577s
> real	0m2.735s
> real	0m25.275s
> 
> $ for length in 100 1000 10000 100000 1000000; do (time java
> WithoutDotStar $length) 2>&1 | grep real; done
> real	0m0.309s
> real	0m0.361s
> real	0m0.468s
> real	0m1.184s
> real	0m9.703s
> 
> Kindly,
> // Carl
> <aminoAcids.patch><WithDotStar.java><WithoutDotStar.java>

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/





More information about the Biojava-l mailing list