[Biopython-dev] RE:Sequence format readers

Peter Wilkinson pewilkinson at informaxinc.com
Thu Sep 6 19:14:21 EDT 2001


I have almost completed code for reading in Refseek data. I have finished
classes (1st draft, but functions well) for the smaller organisms), and now
I am moving on to the Human records ....

Also I we need a parser for Derwent data, which should inherit from EMBL,
since its formatting is EMBL like.

Next aslo is the expression data from different manufacturers ....

there are piles more I am sure

Peter Wilkinson

P.S. I am sitting on code for specific fasta formated types .... how about
that?





> -----Original Message-----
> From: biopython-dev-admin at biopython.org
> [mailto:biopython-dev-admin at biopython.org]On Behalf Of
> biopython-dev-request at biopython.org
> Sent: Thursday, September 06, 2001 10:02 AM
> To: biopython-dev at biopython.org
> Subject: Biopython-dev digest, Vol 1 #207 - 7 msgs
>
>
> Send Biopython-dev mailing list submissions to
> 	biopython-dev at biopython.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://biopython.org/mailman/listinfo/biopython-dev
> or, via email, send a message with subject or body 'help' to
> 	biopython-dev-request at biopython.org
>
> You can reach the person managing the list at
> 	biopython-dev-admin at biopython.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython-dev digest..."
>
>
> Today's Topics:
>
>    1. sequence format readers ? (thomas at cbs.dtu.dk)
>    2. localblast bug? (Chunlei Wu)
>    3. Re: localblast bug? (Brad Chapman)
>    4. Re: sequence format readers ? (Brad Chapman)
>    5. Re: [BioPython] Biopython 1.00a3 release now available
> (Jeffrey Chang)
>    6. Re: sequence format readers ? (Thomas Sicheritz-Ponten)
>    7. Re: Biopython 1.00a3 for the Mac (Johann Visagie)
>
> --__--__--
>
> Message: 1
> Date: Wed, 5 Sep 2001 20:45:40 +0200 (MDT)
> From: thomas at cbs.dtu.dk
> To: biopython-dev at biopython.org
> Reply-To: thomas at cbs.dtu.dk
> Subject: [Biopython-dev] sequence format readers ?
>
> Hej,
>
> To follow up one of the discussions and questions at ISMB in
> Copenhagen,
> - how are we going to proceed with the sequence format reader (the
> biopython variant of readseq ...)
>
> Currently we can only have parsers for Fasta, Embl and
> GenBank.  What we
> need is a internal format and functions/modules which can read/write:
> Fasta
> Embl
> GenBank
> GCG
> Phylip
> PIR
> MSF
> Nexus
> Clustal
> Mase
> ??? - more suggestions ?
>
> I can write most of the rules, but I guess we have to define
> a smart base
> class/parser - where plugging in a new format should only
> take 5 seconds ...
> If we brain storm on the design of the reader/writer, I could
> volunteer to
> implement the format rules ...
>
> Some things to consider:
> * some formats are alignment based (e.g. clustal, phylip, nexus)
> * some formats have loads of information which is lost when
> converted to a
>   lower info-rich format( e.g. Embl -> Fasta). But Embl ->
> GenBank should
>   not lose any information
> * some formats allow multiple entries, some not
>
>
> back-in-the-sequence-format-jungle'ly yr's
> -thomas
>
>
> --
> Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
> thomas at biopython.org           The Technical University of Denmark
> CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
> Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas
>
> 	De Chelonian Mobile ... The Turtle Moves ...
>
>
> --__--__--
>
> Message: 2
> Date: Wed, 5 Sep 2001 12:50:25 -0700 (PDT)
> From: Chunlei Wu <reillywu at yahoo.com>
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] localblast bug?
>
> Hi,
>    I wrote a script for localblast. It always raised a
> TypeError:
>
>    File "e:\python21\Bio\Blast\NCBIStandalone.py",
> line 1447, in blastall
>    r, w, e = popen2.popen3([blastcmd] + params)
>    File "e:\python21\lib\popen2.py", line 129, in
> popen3
>    w, r, e = os.popen3(cmd, mode, bufsize)
>    TypeError: popen3() argument 1 must be string, not
> list
>
>    When I modified line 1447 as:
>
>    r, w, e = popen2.popen3(' '.join([blastcmd] +
> params))
>
>    then it works.
>
>
>    Chunlei Wu
>
> Python version: Activepython build 210
> Biopython version: 1.00a3
> OS:       WinNT
> source:
>
> def mylocalblast(input_file,output_file,db='nt'):
>     """mylocalblast"""
>
>     from Bio.Blast import NCBIStandalone
>
>     my_blast_db="r:\\blastdb\\"+db
>     my_blast_exe=r"r:\localblast\blastall.exe"
>
>     blast_out, error_info =
> NCBIStandalone.blastall(my_blast_exe,'blastn',my_blast_db,input_file)
>
>     output_f=open(output_file,'w')
>     blast_result=blast_out.read()
>     output_f.write(blast_result)
>     print blast_result
>     output_f.close()
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Get email alerts & NEW webcam video instant messaging with
> Yahoo! Messenger
> http://im.yahoo.com
>
> --__--__--
>
> Message: 3
> Date: Wed, 5 Sep 2001 17:27:45 -0400
> From: Brad Chapman <chapmanb at arches.uga.edu>
> To: Chunlei Wu <reillywu at yahoo.com>
> Cc: biopython-dev at biopython.org
> Subject: Re: [Biopython-dev] localblast bug?
>
> Hi Chunlei;
>
> >    I wrote a script for localblast. It always raised a
> > TypeError:
> [...]
> >    When I modified line 1447 as:
> >
> >    r, w, e = popen2.popen3(' '.join([blastcmd] +
> > params))
> >
> >    then it works.
>
> Thanks for the fix. I think you're probably the first to use the
> localblast module on windows, so you get to run into the platform
> specific problems (aren't you lucky :-). Your fix works fine for me
> on UNIX as well (with the Doc/examples/local_blast.py script), so I
> checked your change into CVS. It is available from anonymous CVS and
> should be in the next release.
>
> Thanks again!
> Brad
>
> --__--__--
>
> Message: 4
> Date: Wed, 5 Sep 2001 17:46:00 -0400
> From: Brad Chapman <chapmanb at arches.uga.edu>
> To: biopython-dev at biopython.org
> Subject: Re: [Biopython-dev] sequence format readers ?
>
> Hi Thomas!
>
> > To follow up one of the discussions and questions at ISMB
> in Copenhagen,
> > - how are we going to proceed with the sequence format reader (the
> > biopython variant of readseq ...)
>
> It's great that you're going to work on this! It's definately much
> desired by a lot o' people (in fact I was just having a conversation
> today about format conversion).
>
> > Currently we can only have parsers for Fasta, Embl and
> GenBank.  What we
> > need is a internal format and functions/modules which can
> read/write:
> [...impressive list o' formats...]
> > ??? - more suggestions ?
>
> I think supporting this many would be an *excellent* start :-).
>
> > I can write most of the rules, but I guess we have to
> define a smart base
> > class/parser - where plugging in a new format should only
> take 5 seconds ...
> > If we brain storm on the design of the reader/writer, I
> could volunteer to
> > implement the format rules ...
> >
> > Some things to consider:
> > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > * some formats have loads of information which is lost when
> converted to a
> >   lower info-rich format( e.g. Embl -> Fasta). But Embl ->
> GenBank should
> >   not lose any information
> > * some formats allow multiple entries, some not
>
> Just as a way of getting things started (I haven't done a lot of
> thinking about this), my opinion is that the best way to do this is
> to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> system would be the standard SeqRecord object that we currently
> have. The advantage of this is that existing parsers (ie Fasta,
> GenBank), already parse into this, so all that would need to be done
> is to define a mapping that converts a generic SeqRecord object to
> and from the formats "native" Record based representation. So to
> convert from GenBank to Fasta you could do:
>
> GenBank Record Format --> SeqRecord --> Fasta Record Format
>
> Since the Record formats already provide writing capabilities (and
> we have the parsers to parse into them) we would already get writing
> and parsing "for free." Also, we would make good use of our existing
> "generic" Sequence representations.
>
> The advantages of this is that it would help us avoid having to make
> a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> specific converters. The disadvantage of this is that we may lose
> some information in the conversion process (but than again, what
> converters don't :-).
>
> The tricky part of doing it this way is that we would then need to
> define the Record --> SeqRecord mapping, which, as you mention,
> may take some thinking for alignment formats and other
> complications.
>
> Hopefully-rambling-on-and-on-about-this-helps-a-little-bit-ly yr's,
>
> Brad
>
>
>
> --__--__--
>
> Message: 5
> Date: Wed, 5 Sep 2001 16:08:49 -0700
> To: Johann Visagie <johann at egenetics.com>
> From: Jeffrey Chang <jchang at SMI.Stanford.EDU>
> Cc: biopython-dev at biopython.org
> Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3
> release now available
>
> At 3:46 PM +0200 9/5/01, Johann Visagie wrote:
> >Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700:
> >>
> >>  A new release of Biopython is now available.
> >
> >Cool.  :-)
> >
> >A thought:  Shouldn't these announcements be cross-posted to
> >python-announce-list at python.org, a.k.a
> comp.lang.python.announcee?  :-)
>
> Yes.  Next time.  :)
>
> Thanks,
> Jeff
>
> --__--__--
>
> Message: 6
> To: Brad Chapman <chapmanb at arches.uga.edu>
> Cc: biopython-dev at biopython.org
> Subject: Re: [Biopython-dev] sequence format readers ?
> From: Thomas Sicheritz-Ponten <thomas at cbs.dtu.dk>
> Date: 06 Sep 2001 11:22:01 +0200
>
> Brad Chapman <chapmanb at arches.uga.edu> writes:
>
> > Hi Thomas!
> >
> > > To follow up one of the discussions and questions at ISMB
> in Copenhagen,
> > > - how are we going to proceed with the sequence format reader (the
> > > biopython variant of readseq ...)
> >
> > It's great that you're going to work on this! It's definately much
> > desired by a lot o' people (in fact I was just having a conversation
> > today about format conversion).
> >
> > > Currently we can only have parsers for Fasta, Embl and
> GenBank.  What we
> > > need is a internal format and functions/modules which can
> read/write:
> > [...impressive list o' formats...]
> > > ??? - more suggestions ?
> >
> > I think supporting this many would be an *excellent* start :-).
> >
> > > I can write most of the rules, but I guess we have to
> define a smart base
> > > class/parser - where plugging in a new format should only
> take 5 seconds ...
> > > If we brain storm on the design of the reader/writer, I
> could volunteer to
> > > implement the format rules ...
> > >
> > > Some things to consider:
> > > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > > * some formats have loads of information which is lost
> when converted to a
> > >   lower info-rich format( e.g. Embl -> Fasta). But Embl
> -> GenBank should
> > >   not lose any information
> > > * some formats allow multiple entries, some not
> >
> > Just as a way of getting things started (I haven't done a lot of
> > thinking about this), my opinion is that the best way to do this is
> > to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> > system would be the standard SeqRecord object that we currently
> > have. The advantage of this is that existing parsers (ie Fasta,
> > GenBank), already parse into this, so all that would need to be done
> > is to define a mapping that converts a generic SeqRecord object to
> > and from the formats "native" Record based representation. So to
> > convert from GenBank to Fasta you could do:
> >
> > GenBank Record Format --> SeqRecord --> Fasta Record Format
> >
> > Since the Record formats already provide writing capabilities (and
> > we have the parsers to parse into them) we would already get writing
> > and parsing "for free." Also, we would make good use of our existing
> > "generic" Sequence representations.
> >
> > The advantages of this is that it would help us avoid having to make
> > a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> > specific converters. The disadvantage of this is that we may lose
> > some information in the conversion process (but than again, what
> > converters don't :-).
>
> I think inheriting the Seq object to a SeqIOSeq object is enough.
> We just need to add a single dictionary (features) where all
> Swiss/EMBL/GenBank extra annotations can be added.
>
> e.g.
> class SeqIOSeq(Seq):
>     def __init__(self):
>         Seq.__init__(self)
>         # dictionary for extra annotations (e.g. Embl, GenBank)
>         self.features = {}
>
>
> In the case of
> GenBank Record Format --> SeqIOSeq --> Fasta Record Format
> we pick only the the name and sequence ...
>
> but for
> GenBank Record Format --> SeqIOSeq --> EMBL Record Format
> the writer function should check if there are any additional features
> (self.features.keys())
> That way we shouldn't loose any information.
>
>
> It would be nice if a new format can be added by simply
> adding functions
> for reading, writing and recognizing the format.
> I not completely sure of how to define these functions - any ideas ?
>
> example code ...
>
> import sys
> from Bio.Seq import Seq
> NO, YES = 0,1
>
> class SeqIOSeq(Seq):
>     def __init__(self):
>         Seq.__init__(self)
>         # dictionary for extra annotations (e.g. Embl, GenBank)
>         self.features = {}
>
>
> class SeqIO:
>     # dictionary to store functions for
>     # recognizing, reading and writing of different sequence formats
>     recognizers = {}
>     readers = {}
>     writers = {}
>
>     def __init__(self, **kwds):
>         self.name = None
>         self.format = None
>         self.sequence = SeqIOSeq()
>         self.is_an_alignment = NO
>         self.allow_multiple_entries = YES
>         for k,v in kwds: setattr(self, k, v)
>
>     def AddFormat(self, name, recognizeF, readF, writeF):
>         self.recognizers[name] = recognizeF
>         self.readers[name] = readF
>         self.writers[name] = writeF
>
>
> needing-a-machete-for-the-sequence-format-jungle'ly yr's
> -thomas
>
> --
> Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
> thomas at biopython.org           The Technical University of Denmark
> CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
> Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas
>
> 	De Chelonian Mobile ... The Turtle Moves ...
>
> --__--__--
>
> Message: 7
> Date: Thu, 6 Sep 2001 14:49:29 +0200
> From: Johann Visagie <johann at egenetics.com>
> To: biopython-dev at biopython.org
> Subject: Re: [Biopython-dev] Biopython 1.00a3 for the Mac
>
> Yair Benita on 2001-09-05 (Wed) at 11:28:00 +0200:
> >
> > I have compiled the new release for the Mac.
>
> FreeBSD port has also just been updated:
>   http://www.freebsd.org/cgi/cvsweb.cgi/ports/biology/py-biopython/
>
> Pre-built package (minus CORBA) should appear here in a
> couple of days:
>
> ftp://ftp.freebsd.org/pub/FreeBSD/ports/i386/packages-stable/A
ll/py-biopython-1.00.a3.tgz

Unfortunately, this all comes a day or two too late to make it into
4.4-RELEASE, and hence onto the distribution CDs.  :-(

-- V


--__--__--

_______________________________________________
Biopython-dev mailing list
Biopython-dev at biopython.org
http://biopython.org/mailman/listinfo/biopython-dev


End of Biopython-dev Digest




More information about the Biopython-dev mailing list