[Bioperl-l] SeqIO::swiss->write_seq

Karger, Amir AKarger@CuraGen.com
Thu, 28 Jun 2001 11:00:44 -0400


Heikki made the mistake of encouraging me.

Because I'm needing to parse Swiss-prot files (thanks for saving me a lot of
parsing work!) I'm using Bio::Seq::swiss.pm. I noticed that the output from
write_seq isn't quite the same as the input to next_seq. I don't know
whether that's a design goal or not. But I think at least some of the fixes
are trivial. I did a next_seq and a write_seq on the bioperl's t/swiss.dat.
(I should mention that 0.7.1 had a significantly smaller diff than 0.7.)
Here it is:

9c9
< GN   GC1QBP OR HABP1 OR SF2P32 OR C1QBP
---
> GN   GC1QBP OR HABP1 OR SF2P32 OR C1QBP.

Looks to me like a one-character bug-fix! (Ah. I just saw in CVS that this
was fixed.)

11,12c11,13
< OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; 
< OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.

Maybe _write_line_swissprot_regex should be called with length 78 or 79
instead of 80?

---
> OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
> OX   NCBI_TaxID=9606;

OX isn't in the most recent (May 2000!) manual, so I can understand why
bioperl wouldn't handle it.

18,20c19,21
< RA   Leffers H.
< RT   "Cloning and expression of a cDNA covering the complete coding region
of 
< RT   the P32 subunit of human pre-mRNA splicing factor SF2."
---
> RA   Leffers H.;
> RT   "Cloning and expression of a cDNA covering the complete coding region
> RT   of the P32 subunit of human pre-mRNA splicing factor SF2.";

Semicolons are removed in next_seq (or actually in
_read_swissprot_References). But they aren't reapplied in write_seq.

[several more RA/RT differences snipped]

60,62c61,63
< DR   EMBL; L04636; AAA16315.1.
< DR   EMBL; M69039; AAA73055.1.
< DR   EMBL; X75913; CAA53512.1.
---
> DR   EMBL; L04636; AAA16315.1; -.
> DR   EMBL; M69039; AAA73055.1; -.
> DR   EMBL; X75913; CAA53512.1; -.

This one baffled me for a while, since the - should be in the comment field.
I finally decided to copy some of the code from next_seq into a command-line
perl interpreter, and at that point realized that line 260 of swiss.pm says

    $comment = s///

instead of

    $comment =~ s///

Aha! 

69,70c70,71
< FT   CHAIN        74    282       COMPLEMENT COMPONENT 1, Q
SUBCOMPONENTBINDING 
< FT                                PROTEIN.
---
> FT   CHAIN        74    282       COMPLEMENT COMPONENT 1, Q SUBCOMPONENT
> FT                                BINDING PROTEIN.

I think this means line 929 of swiss.pm should read:

$desc .= " $1"; # replace \n with a space

Amir Karger
Curagen Corporation