[EMBOSS] db formatting (?) and parsing issue -- emboss version 5.0.0
George Magklaras
georgios at biotek.uio.no
Wed Jan 28 11:00:14 UTC 2009
Hi list,
We are still at emboss 5.0.0 (plus patches). We have a problem using
seqret to parse normal IDs from a file that we cannot understand. Here
is the story with details:
I have an .fna file from a 454 read in fasta format, that goes typically
like this :
>FLTU7OB01CIMST length=234 xy=0915_0859 region=1 run=R_2008_12_11_14_44_02_
TTTATTATTTAATCAATAATAAAGTGCTTTAGTCAAATCGTGATGTTTCAATTATTAACA
AGTTTATTATTTCTTCATTTTACCATAATACGCTTCAAAACGTCGATGAACATATGAATT
TGAGGGATTTTTGTAACCAGGTTTTATTTTTTAAAAATCATTAAAAAATGGTGAAGTTTC
TCGAATATCGTGTTCAAAATTCAATTCCGAAATAAGTCGCCCCTAATCTGATGA
>FLTU7OB01DL726 length=211 xy=1366_0736 region=1 run=R_2008_12_11_14_44_02_
AAACAGATAGTCAGTATTGAATTACTTTATGTAGAGCCACAATTTAGAAACAGAGGTTTA
GCTACTATACTGAAGTGTGGTATTGAGACTTGGGCAAAAAGTATAAAAGCGAAACAAATC
ATTAGTACAGTACATAAAGACAACGTGACAATGATATCATTGAACAAGCGGTTAGGGTAT
CAATTAAGTCACGTGAAAATGTATAAAGATA
....
Length is:
cat 068_2023_454Reads.fna | grep ^">" | wc -l
288507
I convert this file to EMBL format using seqret and I get a properly
formatted file with the same number of sequence entries:
cat staphyl68.dat | grep "^ID" | wc -l
288507
I now make a btree index of the id field with dbxflat:
$ dbxflat
Database b+tree indexing for flat file databases
Basename for index files: staphyl68
Resource name: staphyl68
EMBL : EMBL
SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
GB : Genbank, DDBJ
REFSEQ : Refseq
Entry format [SWISS]: EMBL
....
Index fields [id,acc]: id
Processing file ./staphyl68.dat
(resource records and db defs also OK)
That seems to produce the right number of files:
tjonasse at dias ~/mrsa/454/068_reads $ ls staphyl68.*
staphyl68.dat staphyl68.ent staphyl68.pxid staphyl68.xid
And here starts the problem: We have an input text file 'ahits' with
sequence IDs per line:
FLTU7OB01AH8CG
FLTU7OB01ASKRR
FLTU7OB01AUXQJ
FLTU7OB01DSL0N
FLTU7OB01BB9NP
(no fancy control characters, checking with od:
0000000 F L T U 7 O B 0 1 A H 8 C G \n F
0000020 L T U 7 O B 0 1 A S K R R \n F L
0000040 T U 7 )
We extract the 'ahits' sequences (1000 sequences) from the emboss
database by doing simply:
for seq in `cat ahits`; do seqret -filter staphyl68-id:$seq; done >
multifasta68.fasta
And that produces exactly a 1000 seq multifasta file.
Now, then, we have a second file called 'bhits' (697 sequences). This
file has exactly the same format as 'ahits', but when we try to extract
the identified sequences, we get the following:
for seq in `cat bhits`; do seqret -filter staphyl68-id:$seq; done
Died: seqret terminated: Bad value for '-sequence' with -auto defined
'rror: Unable to read sequence 'staphyl68-id:FLTU7OB01AJHZO
(one error per sequence ID)
This is wrong. Why? I know that the seq identifiers of 'bhits' are in
the original fna file, the .dat EMBL file and also on the *.xid entry:
cat 068_2023_454Reads.fna | grep FLTU7OB01AJHZO
>FLTU7OB01AJHZO length=276 xy=0104_3906 region=1 run=R_2008_12_11_14_44_02_
cat staphyl68.dat | grep FLTU7OB01AJHZO
ID FLTU7OB01AJHZO; SV 1; linear; unassigned DNA; STD; UNC; 276 BP.
strings staphyl68.xid | grep -i FLTU7OB01AJHZO
fltu7ob01ajhzo
In addition, if I try the single identifier on its own, it works:
seqret staphyl68-id:FLTU7OB01AJHZO
Reads and writes (returns) sequences
output sequence(s) [fltu7ob01ajhzo.fasta]:
cat fltu7ob01ajhzo.fasta
>FLTU7OB01AJHZO FLTU7OB01AJHZO.1 length=276 xy=0104_3906 region=1
run=R_2008_12_11_14_44_02_
TCGAATGATTAATCTTGAAAATAAAACCTTCGTAATTATGGGTATTGCTAATAAACGTAG
TATCGGATTTGGCGTTGCAAAGGTATTAGATCAATTAGGGGCTAAACTTGTTTTCACTTA
TCGTAAAGACCGTAGCCGCAAAGAATTAGAAAAATTATTAGAACAATTAAACCAAGAAGA
GCCAAAATTATATCAAATCGATGTTCAAAAAGATGAAGATGTAGTAAATGGTTTTGCTAA
AATTGGCGAAGAAGTAGGCAATATTGATGGCGTATA
so, my question is:
Why does the filter mode seqret invoked inside the for loop fails and
this one works, and the problem does not exist for the 'afile' but only
the 'bfile'?
Thanks for any answers.
GM
--
--
George Magklaras BSc Hons MPhil
RHCE:805008309135525
Senior Computer Systems Engineer/UNIX-Linux Systems Administrator
EMBnet Technical Management Board
The Biotechnology Centre of Oslo,
University of Oslo
http://folk.uio.no/georgios
More information about the EMBOSS
mailing list