[EMBOSS] db formatting (?) and parsing issue -- emboss version 5.0.0

Wed Jan 28 11:00:14 UTC 2009

Hi list,

We are still at emboss 5.0.0 (plus patches). We have a problem using 
seqret to parse normal IDs from a file that we cannot understand. Here 
is the story with details:

I have an .fna file from a 454 read in fasta format, that goes typically 
like this :

 >FLTU7OB01CIMST length=234 xy=0915_0859 region=1 run=R_2008_12_11_14_44_02_
TTTATTATTTAATCAATAATAAAGTGCTTTAGTCAAATCGTGATGTTTCAATTATTAACA
AGTTTATTATTTCTTCATTTTACCATAATACGCTTCAAAACGTCGATGAACATATGAATT
TGAGGGATTTTTGTAACCAGGTTTTATTTTTTAAAAATCATTAAAAAATGGTGAAGTTTC
TCGAATATCGTGTTCAAAATTCAATTCCGAAATAAGTCGCCCCTAATCTGATGA
 >FLTU7OB01DL726 length=211 xy=1366_0736 region=1 run=R_2008_12_11_14_44_02_
AAACAGATAGTCAGTATTGAATTACTTTATGTAGAGCCACAATTTAGAAACAGAGGTTTA
GCTACTATACTGAAGTGTGGTATTGAGACTTGGGCAAAAAGTATAAAAGCGAAACAAATC
ATTAGTACAGTACATAAAGACAACGTGACAATGATATCATTGAACAAGCGGTTAGGGTAT
CAATTAAGTCACGTGAAAATGTATAAAGATA
....

Length is:
cat 068_2023_454Reads.fna | grep ^">" | wc -l
288507

I convert this file to EMBL format using seqret and I get a properly 
formatted file with the same number of sequence entries:

cat staphyl68.dat | grep "^ID" | wc -l
288507

I now make a btree index of the id field with dbxflat:
$ dbxflat
Database b+tree indexing for flat file databases
Basename for index files: staphyl68
Resource name: staphyl68
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: EMBL
....
Index fields [id,acc]: id
Processing file ./staphyl68.dat

(resource records and db defs also OK)

That seems to produce the right number of files:

tjonasse at dias ~/mrsa/454/068_reads $ ls  staphyl68.*
staphyl68.dat  staphyl68.ent  staphyl68.pxid  staphyl68.xid

And here starts the problem: We have an input text file 'ahits' with 
sequence IDs per line:

FLTU7OB01AH8CG
FLTU7OB01ASKRR
FLTU7OB01AUXQJ
FLTU7OB01DSL0N
FLTU7OB01BB9NP

(no fancy control characters, checking with od:
0000000   F   L   T   U   7   O   B   0   1   A   H   8   C   G  \n   F
0000020   L   T   U   7   O   B   0   1   A   S   K   R   R  \n   F   L
0000040   T   U   7   )

We extract the  'ahits' sequences (1000 sequences) from the emboss 
database by doing simply:

for seq in `cat ahits`; do seqret -filter staphyl68-id:$seq; done > 
multifasta68.fasta

And that produces exactly a 1000 seq multifasta file.

Now, then, we have a second file called 'bhits' (697 sequences). This 
file has exactly the same format as 'ahits', but when we try to extract 
the identified sequences, we get the following:

for seq in `cat bhits`; do seqret -filter staphyl68-id:$seq; done

Died: seqret terminated: Bad value for '-sequence' with -auto defined
'rror: Unable to read sequence 'staphyl68-id:FLTU7OB01AJHZO
(one error per sequence ID)

This is wrong. Why? I know that the seq identifiers of 'bhits' are in 
the original fna file, the .dat EMBL file and also on the *.xid entry:

cat 068_2023_454Reads.fna | grep FLTU7OB01AJHZO
 >FLTU7OB01AJHZO length=276 xy=0104_3906 region=1 run=R_2008_12_11_14_44_02_

cat staphyl68.dat | grep FLTU7OB01AJHZO
ID   FLTU7OB01AJHZO; SV 1; linear; unassigned DNA; STD; UNC; 276 BP.

strings staphyl68.xid | grep -i FLTU7OB01AJHZO
fltu7ob01ajhzo

In addition, if I try the single identifier on its own, it works:
seqret staphyl68-id:FLTU7OB01AJHZO
Reads and writes (returns) sequences
output sequence(s) [fltu7ob01ajhzo.fasta]:
cat fltu7ob01ajhzo.fasta
 >FLTU7OB01AJHZO FLTU7OB01AJHZO.1 length=276 xy=0104_3906 region=1 
run=R_2008_12_11_14_44_02_
TCGAATGATTAATCTTGAAAATAAAACCTTCGTAATTATGGGTATTGCTAATAAACGTAG
TATCGGATTTGGCGTTGCAAAGGTATTAGATCAATTAGGGGCTAAACTTGTTTTCACTTA
TCGTAAAGACCGTAGCCGCAAAGAATTAGAAAAATTATTAGAACAATTAAACCAAGAAGA
GCCAAAATTATATCAAATCGATGTTCAAAAAGATGAAGATGTAGTAAATGGTTTTGCTAA
AATTGGCGAAGAAGTAGGCAATATTGATGGCGTATA

so, my question is:

Why does the filter mode seqret invoked inside the for loop fails and 
this one works, and the problem does not exist for the 'afile' but only 
the 'bfile'?

Thanks for any answers.

GM

-- 
--
George Magklaras BSc Hons MPhil
RHCE:805008309135525

Senior Computer Systems Engineer/UNIX-Linux Systems Administrator
EMBnet Technical Management Board
The Biotechnology Centre of Oslo,
University of Oslo
http://folk.uio.no/georgios