[Biojava-l] case-sensitive sequences

Wed Feb 28 19:56:36 UTC 2007

Yes, there is one.

i am writing a small program, which my coworker will use. first it 
downloads some repeat masked sequences and then with 
restrictionsitefinder it finds cuts and then exports all fragments.  
these fragments should be repeat masked (case-sensitive) too. after i 
find the cut positions, with SequenceTools.subSequence() i'll extract 
the fragments and write them out.

here is the code sample:

BufferedReader br = new BufferedReader(new FileReader("aSeq.fasta"));

Alphabet maskeddna = SoftMaskedAlphabet.getInstance(DNATools.getDNA());
SymbolTokenization dnaParser = maskeddna.getTokenization("token");

RichSequenceIterator iter = 
RichSequence.IOTools.readFasta(br,dnaParser,null);
RichSequence seq = iter.nextRichSequence();

SimpleThreadPool threadPool = new SimpleThreadPool();
RestrictionEnzyme enzyme = RestrictionEnzymeManager.getEnzyme("MseI");
RestrictionMapper mapper = new RestrictionMapper(threadPool);

mapper.addEnzyme(enzyme);
mapper.annotate(seq);

this throws:

Exception in thread "Thread-3" java.lang.UnsupportedOperationException: 
Ambiguity should be handled at the level of the wrapped Alphabet
        at 
org.biojava.bio.symbol.SoftMaskedAlphabet.getAmbiguity(SoftMaskedAlphabet.java:183)
        at 
org.biojava.bio.symbol.AlphabetManager.getAllSymbols(AlphabetManager.java:223)
        at 
org.biojava.bio.seq.io.SymbolListCharSequence.<init>(SymbolListCharSequence.java:75)
        at 
org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:73)
        at 
org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295)

Mark Schreiber wrote:
> Hi -
>
> Is there any reason why you need to be running the restriction finder
> over the soft masked sequence?
>
> Can you post some example code to replicate the bug/annoyance?
>
> If you think this is a genuine bug then please submit a biojava bug
> report to http://bugzilla.open-bio.org/
> Please also include the example code that demonstrates the bug.
>
> Thanks.
>
> - Mark
>
> On 2/28/07, Ilhami Visne <ilhami.visne at gmail.com> wrote:
>> i've changed my code and called the RestrictionSiteFinder with the new
>> sequence. it's throwed this exception.
>>
>> Exception in thread "Thread-25"
>> java.lang.UnsupportedOperationException: Ambiguity should be handled
>> at the level of the wrapped Alphabet
>>         at 
>> org.biojava.bio.symbol.SoftMaskedAlphabet.getAmbiguity(SoftMaskedAlphabet.java:183) 
>>
>>         at 
>> org.biojava.bio.symbol.AlphabetManager.getAllSymbols(AlphabetManager.java:223) 
>>
>>         at 
>> org.biojava.bio.seq.io.SymbolListCharSequence.<init>(SymbolListCharSequence.java:75) 
>>
>>         at 
>> org.biojava.bio.molbio.RestrictionSiteFinder.run(RestrictionSiteFinder.java:73) 
>>
>>         at 
>> org.biojava.utils.SimpleThreadPool$PooledThread.run(SimpleThreadPool.java:295) 
>>
>>
>> i understand why it didn't work (lower case symbol 'a' and upper
>> symbol 'A'), but i can't find a solution. Any idea?
>>
>> On 2/28/07, ilhami visne <ilhami.visne at gmail.com> wrote:
>> > Thank you. it does now. i should able to find it myself, but i am 
>> really
>> > not a bioinformaticians yet.
>> >
>> > my code (maybe there is someone, who has the same problem like me)
>> >
>> > BufferedReader br = new BufferedReader(new FileReader("seq.fasta"));
>> >
>> > Alphabet dna = SoftMaskedAlphabet.getInstance(DNATools.getDNA());
>> > SymbolTokenization dnaParser = dna.getTokenization("token");
>> >
>> > RichSequenceIterator iter =
>> > RichSequence.IOTools.readFasta(br,dnaParser,null);
>> > RichSequence rs = iter.nextRichSequence();
>> >
>> > Mark Schreiber wrote:
>> > > Hi -
>> > >
>> > > There are also the classes: SoftMaskedAlphabet and
>> > > SoftMaskedAlphabet.CaseSensitiveTokenization and
>> > > SoftMaskedAlphabet.MaskingDetector. Together these classes let you
>> > > read a sequence that contains case sensitive information and (if you
>> > > wish) make use of that information. You can also write out the
>> > > sequence in the original case sensitive format.
>> > >
>> > > It was originally designed for reading data that had been 
>> 'softmasked'
>> > > for low complexity regions (eg lower case regions are low complexity
>> > > and would be ignored in subsequent analysis) but it would be used 
>> for
>> > > quality or any other distinction.
>> > >
>> > > - Mark
>> > >
>> > > On 2/28/07, ilhami visne <ilhami.visne at gmail.com> wrote:
>> > >> Thank you for quick answer. Here is the part of my code:
>> > >>
>> > >> BufferedReader br = new BufferedReader(new 
>> FileReader("seq.fasta"));
>> > >> RichSequenceIterator iter = 
>> RichSequence.IOTools.readFastaDNA(br,null);
>> > >> RichSequence rs = iter.nextRichSequence();
>> > >>
>> > >> Richard Holland wrote:
>> > >> > -----BEGIN PGP SIGNED MESSAGE-----
>> > >> > Hash: SHA1
>> > >> >
>> > >> > DNA is not case-sensitive. What I suspect you are parsing is the
>> > >> output
>> > >> > of some sequencing software which is using case as a rough
>> > >> indicator of
>> > >> > base calling quality?
>> > >> >
>> > >> > The case will have been lost when the file was parsed, not at the
>> > >> moment
>> > >> > you iterate over the resulting sequences. This means that you 
>> have to
>> > >> > modify your file parsing method to become case-sensitive.
>> > >> >
>> > >> > The default DNA alphabet is not case-sensitive. It makes no
>> > >> distinction
>> > >> > between the two, and will convert everything to one case.
>> > >> >
>> > >> > If you need to preserve case, you will need to use a custom 
>> alphabet
>> > >> > which treats the cases differently, and also specify a 
>> tokenizer which
>> > >> > is case-sensitive. See the help pages at http://biojava.org/ for
>> > >> help on
>> > >> > creating new alphabets. Or, have a look at the ABITools.QUALITY
>> > >> alphabet
>> > >> > in BioJava, which interprets the case and stores the quality 
>> scores
>> > >> > separately.
>> > >> >
>> > >> > Note however that your custom alphabet is NOT the same as the 
>> original
>> > >> > DNA alphabet, and so you may not be able to use it in all the 
>> standard
>> > >> > transforms (RNA etc.). If you do want to use these then you will
>> > >> have to
>> > >> > make a second copy of each sequence using the normal DNA 
>> alphabet and
>> > >> > pass that copy to the routines.
>> > >> >
>> > >> > If you post to this list the code you are using to read the file,
>> > >> then I
>> > >> > can show you where to insert the reference to this new alphabet.
>> > >> >
>> > >> > cheers,
>> > >> > Richard
>> > >> >
>> > >> > Ilhami Visne wrote:
>> > >> >
>> > >> >> my sequence files contain case-sensitive symbols (TAATAACgagagg)
>> > >> and i am
>> > >> >> using now RichSequenceIterator to iterate over the sequences.
>> > >> >>
>> > >> >> How can i tell biojava that it should parse it 
>> case-sensitive? if
>> > >> i call
>> > >> >> seq.seqString() method, it should return exactly like it was in
>> > >> the file
>> > >> >> with upper- and lower-case.
>> > >> >>
>> > >> >> thanx.
>> > >> >> _______________________________________________
>> > >> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> > >> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> > >> >>
>> > >> >>
>> > >> > -----BEGIN PGP SIGNATURE-----
>> > >> > Version: GnuPG v1.4.2.2 (GNU/Linux)
>> > >> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>> > >> >
>> > >> > iD8DBQFF5Etv4C5LeMEKA/QRAnGBAJ45eeQhmb4AT0CLTQCVyn5HxFS/cQCfXXgv
>> > >> > uZKlrdE8y6vMfKcOlm9yBZA=
>> > >> > =2VZC
>> > >> > -----END PGP SIGNATURE-----
>> > >> >
>> > >> >
>> > >>
>> > >> _______________________________________________
>> > >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> > >> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> > >>
>> > >
>> >
>> >
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>