[Bioperl-l] SeqIO PIR format broken
Hilmar Lapp
hilmarl@yahoo.com
Sun, 01 Apr 2001 12:35:13 -0700
Bug report #876 by Kris states that SeqIO::pir is unable to read
back in files it produced itself. As it turns out, this module
can't read PIR files it did not produce itself either. In other
words, it is completely broken.
As a first step towards a remedy, I've added a test input file
t/seqfile.pir. The pir parser doesn't choke, but it would read
only the first entry, and even misses the sequence itself.
I have a few questions, part of which owe to the fact that I'm not
really familiar with PIR format (that is, I hope others on the
list are).
1) I can't believe anyone is sensibly using this module (unless
he/she hacked it), and no-one complained so far (except Kris). Do
we want to support this parser; i.e., is anyone interested in
using it?
2) The write_seq() method of SeqIO::pir prints according to the
following syntax ('<acc>' being the accession number, '<sequence>'
a multi-line sequence)
>P1;<acc1>
description
>P1;<acc1>
<sequence for acc1>
>P1;<acc2>
and so forth. The files I could download from PIR disagree with
this format in the following points
a) The '>P1;<acc>' is *not* repeated for the sequence. Instead,
the sequence follows directly after the description line.
b) Consequently, there is no empty line if there is a description.
Did the PIR format change sometime ago or am I missing something?
3) PIR sequences can contain somewhat weird things the meaning of
which is not clear to me yet. Look at e.g.
>P1;CCST
cytochrome c - snapping turtle (tentative sequence)
GDVEK.GKKIF.VQKCAQCHTVEKGGKH.KTGPNLNGL.IGRKTGQAEGF.SYTEANKN.KGITWG.EETLM.EY.LENPKKY.IPGTKM.IF.AGIKKKAERADL.IAY.LKDATSK*
>P1;CCFG
cytochrome c - bullfrog (tentative sequence)
GDVEKGKKIF(V,Q.K.C.A.Q.C.H.T.C,E.K.G.G.K.H)KVGPNLYGLIGRKTGQAAGFSYTDANKNKGITW(G.E,D,T.L.M.E.Y)LENPKKYIPGTKMIFAGI(K.K.K.G.E.R.Q)DLIAY(L.K.S,A,C,S,K)*
>F1;C44264
ALL-1/AF-4 clone 25 mutant fusion protein - human (fragment)
/EKPPPVNKQENAGTLNIFSTLSNGNSSKQKIPADGVHRIRVDFKTYSNEVHCVEEILKEMTHSWPPPLTAIHTPSTAEPSKFPFPTKDSQHVSSVTQNQKQYDTSSKTHSNSQQGTSSMLEDDLQLSDSEDSDS/*
Can anyone briefly explain what the dots, braces, and slashes
mean? I haven't had the time yet to search through their web-site;
hopefully there is a document somewhere describing everything (if
anyone can provide the bookmark right away: please do so).
4) The FASTA-like format shown above is not the only format they
have. In fact, you can't get that format from their web-interface,
instead you can choose from CODATA, fasta, and XML. CODATA is
actually a rich format somewhat resembling GenBank. Based on the
Genbank parsing nightmare experiences, I guess no-one will want to
write a parser for that. The XML, however, looks good; is there
any interest in setting up a parser for that (possibly together
with a Bio::DB::PIR module enabling web-queries?)?
Please comment.
Hilmar
--
-----------------------------------------------------------------
Hilmar Lapp email: hilmarl@yahoo.com
GNF, San Diego, Ca. 92122 phone: +1 858 812 1757
-----------------------------------------------------------------