[EMBOSS] fuzzpro oddity
Steve Taylor
stephen.taylor at imm.ox.ac.uk
Fri Mar 13 15:08:27 UTC 2009
Thanks Peter.
Nice detective work. I wonder if anybody in NCBI/RefSeq is reading this? This is where the source is from...I wonder how blast indexing handles not having a > for example?
Steve
>>
>> I have an odd problem. I am trying to search a multi-fasta set of
>> proteins. If I do:
>>
>> fuzzpro -sequence invertebrate.protein.faa -pattern CXXXXFYPXXXXXW
>> -stdout -auto
>>
>> I get:
>>
>> Error: Sequence is not a protein
>>
>> fuzzpro -sequence invertebrate.protein.faa -pattern CXXXXFYPXXXXX
>> -stdout -auto
>>
>> returns results.
>>
>> Any thoughts why? Is this a cryptic way of saying it can't find the
>> motif or some other problem?
>
>
> Nothing to do with the pattern. There is a strange sequence there:
>
> >gi|170590912|ref|XP_001900215.1| hypothetical protein Bm1_43765
> [Brugia malayi]
> MAAQKERLTGDIYJESDIRQKSALSSSATVPSPQMNSQASRSASERQNIWEHRLGIRAPEQNSEQKKYWEYRNIYHIPVP
>
> QGIEFWEDEDKKRWEMINIGGLDESEANRQIKKAKLQLARERQQENRGSRTPQTTHIFFIISLICFGLQIVLAAICIGFC
>
> IYQIFNNSQIEAGIAFLLLALMLLIGAAGGIFSALKRSENLAICTAVYNVTSAVGIIVAIINLYSFRVGQSGNLSAFIPI
>
> AGVVALVQNFNKLS
>
> This sequence has a 'J' which is a mass-spec ambiguity code for "I or L"
> and has somehow crept into a translation (perhaps with an ambiguous
> codon - there are several possibilties)
>
> EMBOSS 3.0.0 refuses to read it. EMBOSS 4.0.0 also fails.
>
> EMBOSS 5.0.0 and 6.0.0 understand J and should be able to process it and
> convert it to X
>
> As for the difference between the patterns - they both fail, but without
> the W it gives some results before it reaches the bad sequence.
>
> As you are reporting the results to stdout, it is not so easy to spot ...
> but just before the tail of the report I get the "Error: Sequence is not
> a protein" line (as one it to stdout and one is to stderr you may not
> see it in exactly the same place)
>
> Solutions are: edit the J (and any others) to X in the file
> and of course to update your EMBOSS installation to 6.0.0
>
> As to the missing second message about the bad sequence ... if this were
> the
> only sequence in the file it would issue a message because it can read
> nothing. When reading through a file with many sequences it assumes the
> first failure is end of file. We need to do something about that - such
> as adding the sequence name to the message so you know where it stopped.
>
> Thanks for the report - it was fun to look into.
>
> Peter
>
>
>
More information about the EMBOSS
mailing list