BLAST, X vs. U, and EMBOSS

David Mathog mathog at mendel.bio.caltech.edu
Tue Jun 4 15:36:21 UTC 2002


In nr there is an entry with gi= 2018236 which ends with:

gcxg

When nr is made into a BLAST database with

%  formatdb -i nr -p T -o T

the database can be dumped out again with

% fastacmd -d nr -D

The problem is when it comes out that sequence fragment
is now

GCUG

This causes problems when piping the output of fastacmd into 
fuzzpro (for instance) because fuzzpro complains that this sequence
is not a protein -  because of the U.  This can be verified by
inserting a 

 tr 'U' 'X'

into the data stream, which lets fuzzpro digest that particular entry.

So what to do?  Modify EMBOSS to add a switch to allow "U" to be part
of the protein alphabet or lobby the NCBI to use 'X' for unknown in
blast databases?  I lean pretty strongly towards the latter option,
although
getting the NCBI to change this may be difficult.

In this regard it's also interesting that there is another problem with
any
sequence containing an X (U) in a BLAST database.  To see what
happens BLAST the NCBI sequence (with the x) against the BLAST
database containing the same entry.

The sequence alignment terminates at the C before the X/U.  That isn't
very informative because BLAST alignments terminate there no matter what
letter I substitute in for X (ie, Y,L,etc.)  So try another sequence
containing a U, this time towards the middle of the sequence, for
instance
emb|CAC39234.1|  (AJ312124) FdhA-I protein.  BLAST that against
nr and you find this odd alignment:

LX-H
L  H
L-UH

and it does that even if the query uses a U instead of an X. Here's
the relevant fragment if you'd like to verify this for yourself:
YLFQKLLRAVVGTNNVDHCARLXHASTVAGLATTLGSGAMT

Now, what happens here if the query sequence is mutated X->Y?
The exact same alignment, with X->Y. 

Are we all agreed that this is a BLAST bug?  X against any
valid amino acid or against  X should  be a mismatch (since
without further information there's a 19/20 probability
it is a mismatch and only a 1/20 change it is a match) but there's no
reason to insert gaps in the alignment at this point.  So it should have
been:

LXH
L H
LUH

This was with blast versions 2.2.2 (local) and 2.2.3 (NCBI server).

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech




More information about the EMBOSS mailing list