[Bioperl-l] SeqIO::swiss->write_seq

Jason Stajich jason@chg.mc.duke.edu
Thu, 28 Jun 2001 11:53:07 -0400


I've been working towards this end (note the recent changes to
embl/genbank/swiss files which were essentially found by running a diff on
the in/out files).  I worked especially hard to make the species parsing
better although I think there are still some issues as you point out.

Currently one problem is the order of a feature's tags - they are not output
the same way they are input.   I'm preparing a message about external perl
module depencies for future releases. I'd like to embrace external modules
cautiously. Tie::IxHash is one I'd like to depend on which preserves the
order items are added to a hash so that keys %hash returns the hash items in
the order they were added in. If we are agreeded in adding this dependancy
it will be trivial to preserve the tag order for a feature.

I'll work with Elia to see about rolling in your ideas and continuing to
test more formats.  Please feel free to continue providing input and/or
consider joining the coding fray as you learn more about the toolkit and
find you want to continue to help.

-jason

----- Original Message -----
From: "Elia Stupka" <elia@ebi.ac.uk>
To: "Karger, Amir" <AKarger@curagen.com>
Cc: "Bioperl Mailing List (E-mail)" <bioperl-l@bioperl.org>
Sent: Thursday, June 28, 2001 11:30 AM
Subject: Re: [Bioperl-l] SeqIO::swiss->write_seq


> Dear Amir,
>
> Thank you so much for your detailed comments, I will try and put in the
> fixes you have described. Thanks again, and do not hesitate to spot
> more... in fact we are aiming at "diffless" parsers, if such a thing
> really does exist... I look forward to the day when the parser.t file will
> be :
>
> if ( ! diff ($infile,$outfile)) {
>    print "ok 1\n";
> }
>
> Elia
>
> > Because I'm needing to parse Swiss-prot files (thanks for saving me a
lot of
> > parsing work!) I'm using Bio::Seq::swiss.pm. I noticed that the output
from
> > write_seq isn't quite the same as the input to next_seq. I don't know
> > whether that's a design goal or not. But I think at least some of the
fixes
> > are trivial. I did a next_seq and a write_seq on the bioperl's
t/swiss.dat.
> > (I should mention that 0.7.1 had a significantly smaller diff than 0.7.)
> > Here it is:
> >
> > 9c9
> > < GN   GC1QBP OR HABP1 OR SF2P32 OR C1QBP
> > ---
> > > GN   GC1QBP OR HABP1 OR SF2P32 OR C1QBP.
> >
> > Looks to me like a one-character bug-fix! (Ah. I just saw in CVS that
this
> > was fixed.)
> >
> > 11,12c11,13
> > < OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > Mammalia;
> > < OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
> >
> > Maybe _write_line_swissprot_regex should be called with length 78 or 79
> > instead of 80?
> >
> > ---
> > > OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > > OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
> > > OX   NCBI_TaxID=9606;
> >
> > OX isn't in the most recent (May 2000!) manual, so I can understand why
> > bioperl wouldn't handle it.
> >
> > 18,20c19,21
> > < RA   Leffers H.
> > < RT   "Cloning and expression of a cDNA covering the complete coding
region
> > of
> > < RT   the P32 subunit of human pre-mRNA splicing factor SF2."
> > ---
> > > RA   Leffers H.;
> > > RT   "Cloning and expression of a cDNA covering the complete coding
region
> > > RT   of the P32 subunit of human pre-mRNA splicing factor SF2.";
> >
> > Semicolons are removed in next_seq (or actually in
> > _read_swissprot_References). But they aren't reapplied in write_seq.
> >
> > [several more RA/RT differences snipped]
> >
> > 60,62c61,63
> > < DR   EMBL; L04636; AAA16315.1.
> > < DR   EMBL; M69039; AAA73055.1.
> > < DR   EMBL; X75913; CAA53512.1.
> > ---
> > > DR   EMBL; L04636; AAA16315.1; -.
> > > DR   EMBL; M69039; AAA73055.1; -.
> > > DR   EMBL; X75913; CAA53512.1; -.
> >
> > This one baffled me for a while, since the - should be in the comment
field.
> > I finally decided to copy some of the code from next_seq into a
command-line
> > perl interpreter, and at that point realized that line 260 of swiss.pm
says
> >
> >     $comment = s///
> >
> > instead of
> >
> >     $comment =~ s///
> >
> > Aha!
> >
> > 69,70c70,71
> > < FT   CHAIN        74    282       COMPLEMENT COMPONENT 1, Q
> > SUBCOMPONENTBINDING
> > < FT                                PROTEIN.
> > ---
> > > FT   CHAIN        74    282       COMPLEMENT COMPONENT 1, Q
SUBCOMPONENT
> > > FT                                BINDING PROTEIN.
> >
> > I think this means line 929 of swiss.pm should read:
> >
> > $desc .= " $1"; # replace \n with a space
> >
> > Amir Karger
> > Curagen Corporation
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> **************************
> tel:    +44 1223 49 44 31
> mobile: +44 7971 59 03 69
> fax:    +44 1223 49 44 68
> **************************
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>