[Bioperl-l] SeqIO::swiss->write_seq
Elia Stupka
elia@ebi.ac.uk
Thu, 28 Jun 2001 16:30:48 +0100 (BST)
Dear Amir,
Thank you so much for your detailed comments, I will try and put in the
fixes you have described. Thanks again, and do not hesitate to spot
more... in fact we are aiming at "diffless" parsers, if such a thing
really does exist... I look forward to the day when the parser.t file will
be :
if ( ! diff ($infile,$outfile)) {
print "ok 1\n";
}
Elia
> Because I'm needing to parse Swiss-prot files (thanks for saving me a lot of
> parsing work!) I'm using Bio::Seq::swiss.pm. I noticed that the output from
> write_seq isn't quite the same as the input to next_seq. I don't know
> whether that's a design goal or not. But I think at least some of the fixes
> are trivial. I did a next_seq and a write_seq on the bioperl's t/swiss.dat.
> (I should mention that 0.7.1 had a significantly smaller diff than 0.7.)
> Here it is:
>
> 9c9
> < GN GC1QBP OR HABP1 OR SF2P32 OR C1QBP
> ---
> > GN GC1QBP OR HABP1 OR SF2P32 OR C1QBP.
>
> Looks to me like a one-character bug-fix! (Ah. I just saw in CVS that this
> was fixed.)
>
> 11,12c11,13
> < OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> Mammalia;
> < OC Eutheria; Primates; Catarrhini; Hominidae; Homo.
>
> Maybe _write_line_swissprot_regex should be called with length 78 or 79
> instead of 80?
>
> ---
> > OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
> > OX NCBI_TaxID=9606;
>
> OX isn't in the most recent (May 2000!) manual, so I can understand why
> bioperl wouldn't handle it.
>
> 18,20c19,21
> < RA Leffers H.
> < RT "Cloning and expression of a cDNA covering the complete coding region
> of
> < RT the P32 subunit of human pre-mRNA splicing factor SF2."
> ---
> > RA Leffers H.;
> > RT "Cloning and expression of a cDNA covering the complete coding region
> > RT of the P32 subunit of human pre-mRNA splicing factor SF2.";
>
> Semicolons are removed in next_seq (or actually in
> _read_swissprot_References). But they aren't reapplied in write_seq.
>
> [several more RA/RT differences snipped]
>
> 60,62c61,63
> < DR EMBL; L04636; AAA16315.1.
> < DR EMBL; M69039; AAA73055.1.
> < DR EMBL; X75913; CAA53512.1.
> ---
> > DR EMBL; L04636; AAA16315.1; -.
> > DR EMBL; M69039; AAA73055.1; -.
> > DR EMBL; X75913; CAA53512.1; -.
>
> This one baffled me for a while, since the - should be in the comment field.
> I finally decided to copy some of the code from next_seq into a command-line
> perl interpreter, and at that point realized that line 260 of swiss.pm says
>
> $comment = s///
>
> instead of
>
> $comment =~ s///
>
> Aha!
>
> 69,70c70,71
> < FT CHAIN 74 282 COMPLEMENT COMPONENT 1, Q
> SUBCOMPONENTBINDING
> < FT PROTEIN.
> ---
> > FT CHAIN 74 282 COMPLEMENT COMPONENT 1, Q SUBCOMPONENT
> > FT BINDING PROTEIN.
>
> I think this means line 929 of swiss.pm should read:
>
> $desc .= " $1"; # replace \n with a space
>
> Amir Karger
> Curagen Corporation
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
**************************
tel: +44 1223 49 44 31
mobile: +44 7971 59 03 69
fax: +44 1223 49 44 68
**************************