[Biojava-l] Masked regions, X's, No Call's N's

Matthew Pocock mrp@sanger.ac.uk
Thu, 05 Oct 2000 16:48:44 +0100


Kevin,

It turns out that it is easier to think of the special case where a 
symbol is atomic - represents itself and no other symbol. So, Symbol has 
a getMatches method that returns the alphabet of all symbols that could 
match that one. Attomic symbols always return an alphabet that only 
returns themselves. Alphabets only realy contain AttomicSymbol objects, 
but the contains and verify methods will accept an ambiguity symbol iff 
all symbols that match it are also members of the alphabet. This means 
that we can make an 'x' ambiguity symbol that is distinct from 'n' but 
is still recognized by DNA.

If you have 1.01 or the latest build then to add 'X' to the standard DNA 
alphabet, you need to edit AlphabetManager.xml, and add it back into 
biojava.jar. AlphabetManager.xml is the resource that sets up the 
standard alphabets, including well-known ambiguity symbols. Open in a 
text editor:

resources/org/biojava/bio/symbol/AlphabetManager.xml

Scan down for the DNA section. You will seel that the alphabet element 
for DNA contains four symbolref elements refering to a, g, c and t. This 
is followed by a set of ambiguity elements that map the ambiguity codes 
to a set of symbols.

If you scroll down further, you will find the ambiguity element for 'n'. 
Copy this element entierly and change n to x. e.g. paste in:

<ambiguity>
<symbol>
<short>x</short>
<long>agct</long>
</symbol>
<symbolref name="guanine"/>
<symbolref name="adenine"/>
<symbolref name="cytosine"/>
<symbolref name="thymine"/>
</ambiguity>

You will then need to suck this into biojava.jar using a command line like:

jar -uf biojava.jar -C resources .

 From then on, x will be accepted by the DNA parser. It will resolve to 
a symbol that matches all DNA symbols, just like n, but will be a 
distinct object. That is, in the VM x != n, even though they are 
functionaly equivalent. This lets you ignore them most of the time, and 
find them if you need to.

Hope this helps. If x is used more than occasionaly, I can make this 
change to the trunk.

Matthew


Kevin T. Pedretti wrote:

> Hi Matthew,
>   I did some more experimentation last night and wrote a little program
> that output a similar list to the one gave below.  At the time I sent the
> email, I didn't think there was any notion of an ambiguity symbols yet in
> the code -- looking through the list archives you had sent some messages
> about adding a AmbiguitySymbol interface but I didn't see that in the API
> docs.  Anyway, in our lab we use runs of Xs to signify repetive regions
> and regions of low complexity as detected by programs like seg and
> repeatmasker.  Ns signify no calls.  X=N for all practical purposes, but
> the distinction is that the N's are put in the sequence by the base
> caller while the X's are added at a later stage in the pipeline and
> typically appear in runs.  I also found this on the NCBI BLAST faq:
> 
> Q: After running a search why do I see a string of "X"s (or "N"s) in my
> query sequence that I did not put there? 
> 
> You are seeing the result of automatic filtering of your query for
> low-complexity sequence that is performed to prevent artifactual hits. The
> filter substitutes any low-complexity sequence that it finds with the
> letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter
> "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can
> result in high scores that reflect compositional bias rather than
> significant position-by-position alignment (Wootton & Federhen, 1996).
> Filter programs can eliminate these potentially confounding matches from
> the blast reports, leaving regions whose BLAST statistics reflect the
> specificity of their parities alignment. Queries searched with the blastn 
> program are filtered with DUST. The other BLAST programs use SEG. 
> 
> 
> So maybe the standard protocol is to use Ns... maybe the write thing for
> me to do is preprocess any fasta files to replace Xs with Ns.  Or do you
> think it would be better to add an X to the alphabet?
> 
> Kevin