[Bioperl-l] SeqIO::swiss->write_seq

Elia Stupka elia@ebi.ac.uk
Thu, 28 Jun 2001 16:30:48 +0100 (BST)


Dear Amir,

Thank you so much for your detailed comments, I will try and put in the
fixes you have described. Thanks again, and do not hesitate to spot
more... in fact we are aiming at "diffless" parsers, if such a thing
really does exist... I look forward to the day when the parser.t file will
be :

if ( ! diff ($infile,$outfile)) {
   print "ok 1\n";
}

Elia

> Because I'm needing to parse Swiss-prot files (thanks for saving me a lot of
> parsing work!) I'm using Bio::Seq::swiss.pm. I noticed that the output from
> write_seq isn't quite the same as the input to next_seq. I don't know
> whether that's a design goal or not. But I think at least some of the fixes
> are trivial. I did a next_seq and a write_seq on the bioperl's t/swiss.dat.
> (I should mention that 0.7.1 had a significantly smaller diff than 0.7.)
> Here it is:
> 
> 9c9
> < GN   GC1QBP OR HABP1 OR SF2P32 OR C1QBP
> ---
> > GN   GC1QBP OR HABP1 OR SF2P32 OR C1QBP.
> 
> Looks to me like a one-character bug-fix! (Ah. I just saw in CVS that this
> was fixed.)
> 
> 11,12c11,13
> < OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> Mammalia; 
> < OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
> 
> Maybe _write_line_swissprot_regex should be called with length 78 or 79
> instead of 80?
> 
> ---
> > OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
> > OX   NCBI_TaxID=9606;
> 
> OX isn't in the most recent (May 2000!) manual, so I can understand why
> bioperl wouldn't handle it.
> 
> 18,20c19,21
> < RA   Leffers H.
> < RT   "Cloning and expression of a cDNA covering the complete coding region
> of 
> < RT   the P32 subunit of human pre-mRNA splicing factor SF2."
> ---
> > RA   Leffers H.;
> > RT   "Cloning and expression of a cDNA covering the complete coding region
> > RT   of the P32 subunit of human pre-mRNA splicing factor SF2.";
> 
> Semicolons are removed in next_seq (or actually in
> _read_swissprot_References). But they aren't reapplied in write_seq.
> 
> [several more RA/RT differences snipped]
> 
> 60,62c61,63
> < DR   EMBL; L04636; AAA16315.1.
> < DR   EMBL; M69039; AAA73055.1.
> < DR   EMBL; X75913; CAA53512.1.
> ---
> > DR   EMBL; L04636; AAA16315.1; -.
> > DR   EMBL; M69039; AAA73055.1; -.
> > DR   EMBL; X75913; CAA53512.1; -.
> 
> This one baffled me for a while, since the - should be in the comment field.
> I finally decided to copy some of the code from next_seq into a command-line
> perl interpreter, and at that point realized that line 260 of swiss.pm says
> 
>     $comment = s///
> 
> instead of
> 
>     $comment =~ s///
> 
> Aha! 
> 
> 69,70c70,71
> < FT   CHAIN        74    282       COMPLEMENT COMPONENT 1, Q
> SUBCOMPONENTBINDING 
> < FT                                PROTEIN.
> ---
> > FT   CHAIN        74    282       COMPLEMENT COMPONENT 1, Q SUBCOMPONENT
> > FT                                BINDING PROTEIN.
> 
> I think this means line 929 of swiss.pm should read:
> 
> $desc .= " $1"; # replace \n with a space
> 
> Amir Karger
> Curagen Corporation 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

**************************
tel:    +44 1223 49 44 31
mobile: +44 7971 59 03 69
fax:    +44 1223 49 44 68
**************************