[Bioperl-l] Reading a XML sequence (UniParc) into a BioSeq object
Chris Fields
cjfields at uiuc.edu
Wed Mar 14 14:59:30 UTC 2007
That's probably the best short-term fix though I'm sure it's quite a
bit slower than a direct UniParc XML-to-Bio::Seq via SeqIO. I am
looking into adding a few more XML::SAX-based parsers (INSDSeqXML,
GBSeqXML, EMBLXML, etc), so we could add UniProt XML to the list
(which I think Uniparc uses, correct?).
chris
On Mar 14, 2007, at 8:19 AM, Samuel GRANJEAUD - IR/IFR137 wrote:
> Hi,
>
> Since nobody gave me a clue nor told me that my question is silly (it
> should be ;-) ), I finally realized a hack within an object that
> inherits from BioFetch and overloads post process method, converting
> uniparc XML to swiss format. The really nice approach of parsing
> uniparc
> XML and creating a object was too hard for me.
>
> It's amazing what BioPerl can do.
>
> Regards,
> --Samuel
>
> =head1 NAME
>
> ICIM::Bio::DB::BioFetch - Database object interface to BioFetch
> retrieval
>
> =head1 SYNOPSIS
>
> see Bio::DB::BioFetch
>
> =head1 DESCRIPTION
>
> See Bio::DB::BioFetch for main description.
>
> The Begin code adds a few databases.
>
> The post_process method converts UniParc XML format to Swiss format
> for string transfer type.
>
> =head1 SEE ALSO
>
> This module inherits from BioFetch.
> http://doc.bioperl.org/bioperl-live/Bio/DB/BioFetch.html
>
> This module is a light copy of BioFetch.
>
> =head1 AUTHOR
>
> Email Samuel Granjeaud, E<lt>granjeau at tagc.univ-mrs.frE<gt>
>
> =head1 APPENDIX
>
> The rest of the documentation details each of the object
> methods. Internal methods are usually preceded with a _
>
> =cut
>
> # Let the code begin...
>
> package ICIM::Bio::DB::BioFetch;
>
> use strict;
> use warnings;
>
> use Bio::Root::IO;
>
> use base qw(Bio::DB::BioFetch);
>
> BEGIN {
>
> $Bio::DB::BioFetch::FORMATMAP{ipi} = {
> default => 'swiss', # default BioFetch format/SeqIOmodule
> pair
> swissprot => 'swiss', # alternative BioFetch format/module
> pair
> fasta => 'fasta', # alternative BioFetch format/module
> pair
> namespace => 'ipi',
> };
> $Bio::DB::BioFetch::FORMATMAP{uniparc} = {
> default => 'swiss', # default BioFetch format/SeqIOmodule
> pair
> swissprot => 'swiss', # alternative BioFetch format/module
> pair
> fasta => 'fasta', # alternative BioFetch format/module
> pair
> namespace => 'uniparc',
> };
> }
>
> =head2 postprocess_data
>
> Title : postprocess_data
> Usage : $self->postprocess_data ( 'type' => 'string',
> 'location' => \$datastr);
> Function: process downloaded data before loading into a Bio::SeqIO
> Returns : void
> Args : hash with two keys - 'type' can be 'string' or 'file'
> - 'location' either file location or
> string
> reference containing data
>
> =cut
>
> sub postprocess_data {
> my ($self,%args) = @_;
>
> # check for errors in the stream
> if ($args{'type'} eq 'string') {
> my $stringref = $args{'location'};
> if ($$stringref =~ /^ERROR (\d+) (.+)/m) {
> $self->throw("BioFetch Error $1: $2");
> }
>
> # Post-process: convert UniParc XML format in swiss format
> if ($$stringref =~ /^<entry accession=/m) {
>
> my @pSeq = ();
> while ($$stringref =~ /^(<entry accession=.+?)<\/entry>$/
> msg) {
> # Get an entry
> my $seqEntry = $1;
> $seqEntry =~ s/[\n\r]+/\n/g;
> # Get ID
> my ($id) = ( $seqEntry =~ /<entry accession="(UPI\w
> +)">/m );
> # Get DR, database croos-references
> my @dr = ();
> while ($seqEntry =~ /<dbReference db="(\S+).*?" id="(\S+)"
> .+? active="(\S+)" created="(\S+)" last="(\S+)"\/>$/mg) {
> push (@dr, "DR $1; $2; $3; $4; $5.");
> }
> # Get SQ, sequence itself
> my ($len,$crc,$seq) = ( $seqEntry =~ /<sequence
> length="(\S+)" crc64="(\S+)">$(.+?)<\/sequence>/ms );
> $seq =~ s/^/ /mg;
> $seq =~ s/(\w{10})/ $1/mg;
> $seq =~ s/(\w{10})(\w{1,9})$/$1 $2/m;
> $seq =~ s/^( \w{1,9})$/ $1/m;
> # Append to results
> push( @pSeq,
> sprintf("ID %-20s Reviewed; % 5d AA.\n",
> $id,$len),
> join("\n", at dr,),"\nSQ SEQUENCE $len AA; $crc
> CRC64;$seq//\n" );
> }
> # Replace input string by results
> $$stringref = join('', at pSeq);
>
> }
> }
>
> elsif ($args{'type'} eq 'file') {
> open (F,$args{'location'}) or $self->throw("Couldn't open
> $args{location}: $!");
> # this is dumb, but the error may be anywhere on the first
> three
> lines because the
> # CGI headers are sometimes printed out by the server...
> my @data = (scalar <F>,scalar <F>,scalar <F>);
> if (join('', at data) =~ /^ERROR (\d+) (.+)/m) {
> $self->throw("BioFetch Error $1: $2");
> }
> close F;
> }
>
> else {
> $self->throw("Don't know how to postprocess data of type
> $args{'type'}");
> }
> }
>
> 1;
>
>
> Samuel GRANJEAUD - IR/IFR137 wrote:
>> Hello !
>>
>> I would like to fill a BioSeq object with the output from a dbfetch
>> request at EI on UniParc database (which replies only XML code, as
>> I am
>> interested in references). If somebody could tell which BioPerl
>> object
>> to use or a way or convert it in Swiss format or could tell me the
>> way
>> to do it or has got a piece of code (is
>> http://doc.bioperl.org/bioperl-live/Bio/SeqIO/interpro.html a good
>> starting point), I would appreciate a lot.
>>
>> Best regards,
>> --Samuel
>>
>> <entry accession="UPI00004A0D4A">
>> <dbReferenceList>
>> <dbReference db="EMBL" id="CAI39485" version="1" version_i="1"
>> active="Y" created="04-Jan-2005" last="15-Dec-2006"/>
>> <dbReference db="UniProtKB/TrEMBL" id="Q5JVT0" version="1"
>> version_i="1" active="N" created="15-Feb-2005" last="06-Feb-2007"/>
>> <dbReference db="ENSEMBL" id="ENSP00000352958" version_i="2"
>> active="Y" created="03-Apr-2006" last="27-Nov-2006"/>
>> <dbReference db="IPI" id="IPI00418471" version="4" version_i="4"
>> active="N" created="07-Mar-2005" last="07-Mar-2005"/>
>> <dbReference db="IPI" id="IPI00646867" version="1" version_i="1"
>> active="N" created="06-Sep-2005" last="06-Oct-2006"/>
>> <dbReference db="VEGA" id="OTTHUMP00000019225" version_i="1"
>> active="N" created="15-Aug-2005" last="02-Dec-2005"/>
>> </dbReferenceList>
>> <sequence length="431" crc64="8913D1F04A71CCFB">
>> MSTRSVSSSSYRRMFGGPGTASRPSSSRSYVTTSTRTYSLGSALRPSTSRSLYASSPGGV
>> YATRSSAVRLRSSVPGVRLLQDSVDFSLADAINTEFKNTRTNEKVELQELNDRFANYIDK
>> VRFLEQQNKILLAELEQLKGQGKSRLGDLYEEEMRELRRQVDQLTNDKARVEVERDNLAE
>> DIMRLREKLQEEMLQREEAENTLQSFRQDVDNASLARLDLERKVESLQEEIAFLKKLHEE
>> EIQELQAQIQEQHVQIDVDVSKPDLTAALRDVRQQYESVAAKNLQEAEEWYKSKFADLSE
>> AANRNNDALRQAKQESTEYRRQVQSLTCEVDALKGTNESLERQMREMEENFAVEAANYQD
>> TIGRLQDEIQNMKEEMARHLREYQDLLNVKMALDIEIATYRKLLEGEESRISLPLPNFSS
>> LNLRGKHFISL
>> </sequence>
>> </entry>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list