[Bioperl-l] Reading a XML sequence (UniParc) into a BioSeq object

Wed Mar 14 14:59:30 UTC 2007

That's probably the best short-term fix though I'm sure it's quite a  
bit slower than a direct UniParc XML-to-Bio::Seq via SeqIO.  I am  
looking into adding a few more XML::SAX-based parsers (INSDSeqXML,  
GBSeqXML, EMBLXML, etc), so we could add UniProt XML to the list  
(which I think Uniparc uses, correct?).

chris

On Mar 14, 2007, at 8:19 AM, Samuel GRANJEAUD - IR/IFR137 wrote:

> Hi,
>
> Since nobody gave me a clue nor told me that my question is silly (it
> should be ;-) ), I finally realized a hack within an object that
> inherits from BioFetch and overloads post process method, converting
> uniparc XML to swiss format. The really nice approach of parsing  
> uniparc
> XML and creating a object was too hard for me.
>
> It's amazing what BioPerl can do.
>
> Regards,
> --Samuel
>
> =head1 NAME
>
> ICIM::Bio::DB::BioFetch - Database object interface to BioFetch  
> retrieval
>
> =head1 SYNOPSIS
>
>  see Bio::DB::BioFetch
>
> =head1 DESCRIPTION
>
> See Bio::DB::BioFetch for main description.
>
> The Begin code adds a few databases.
>
> The post_process method converts UniParc XML format to Swiss format
> for string transfer type.
>
> =head1 SEE ALSO
>
> This module inherits from BioFetch.
> http://doc.bioperl.org/bioperl-live/Bio/DB/BioFetch.html
>
> This module is a light copy of BioFetch.
>
> =head1 AUTHOR
>
> Email Samuel Granjeaud, E<lt>granjeau at tagc.univ-mrs.frE<gt>
>
> =head1 APPENDIX
>
> The rest of the documentation details each of the object
> methods. Internal methods are usually preceded with a _
>
> =cut
>
> # Let the code begin...
>
> package ICIM::Bio::DB::BioFetch;
>
> use strict;
> use warnings;
>
> use Bio::Root::IO;
>
> use base qw(Bio::DB::BioFetch);
>
> BEGIN {
>
>     $Bio::DB::BioFetch::FORMATMAP{ipi} = {
>         default   => 'swiss', # default BioFetch format/SeqIOmodule  
> pair
>         swissprot => 'swiss', # alternative BioFetch format/module  
> pair
>         fasta     => 'fasta', # alternative BioFetch format/module  
> pair
>         namespace => 'ipi',
>     };
>     $Bio::DB::BioFetch::FORMATMAP{uniparc} = {
>         default   => 'swiss', # default BioFetch format/SeqIOmodule  
> pair
>         swissprot => 'swiss', # alternative BioFetch format/module  
> pair
>         fasta     => 'fasta', # alternative BioFetch format/module  
> pair
>         namespace => 'uniparc',
>     };
> }
>
> =head2 postprocess_data
>
>  Title   : postprocess_data
>  Usage   : $self->postprocess_data ( 'type' => 'string',
>                      'location' => \$datastr);
>  Function: process downloaded data before loading into a Bio::SeqIO
>  Returns : void
>  Args    : hash with two keys - 'type' can be 'string' or 'file'
>                               - 'location' either file location or  
> string
>                                  reference containing data
>
> =cut
>
> sub postprocess_data {
>     my ($self,%args) = @_;
>
>     # check for errors in the stream
>     if ($args{'type'} eq 'string') {
>         my $stringref = $args{'location'};
>         if ($$stringref =~ /^ERROR (\d+) (.+)/m) {
>             $self->throw("BioFetch Error $1: $2");
>         }
>
>         # Post-process: convert UniParc XML format in swiss format
>         if ($$stringref =~ /^<entry accession=/m) {
>
>         my @pSeq = ();
>         while ($$stringref =~ /^(<entry accession=.+?)<\/entry>$/ 
> msg) {
>             # Get an entry
>             my $seqEntry = $1;
>             $seqEntry =~ s/[\n\r]+/\n/g;
>             # Get ID
>             my ($id)  = ( $seqEntry =~ /<entry accession="(UPI\w 
> +)">/m );
>             # Get DR, database croos-references
>             my @dr = ();
>             while ($seqEntry =~ /<dbReference db="(\S+).*?" id="(\S+)"
> .+? active="(\S+)" created="(\S+)" last="(\S+)"\/>$/mg) {
>                 push (@dr, "DR   $1; $2; $3; $4; $5.");
>             }
>             # Get SQ, sequence itself
>             my ($len,$crc,$seq) = ( $seqEntry =~ /<sequence
> length="(\S+)" crc64="(\S+)">$(.+?)<\/sequence>/ms );
>             $seq =~ s/^/    /mg;
>             $seq =~ s/(\w{10})/ $1/mg;
>             $seq =~ s/(\w{10})(\w{1,9})$/$1 $2/m;
>             $seq =~ s/^(    \w{1,9})$/ $1/m;
>             # Append to results
>             push( @pSeq,
>                 sprintf("ID   %-20s   Reviewed;       % 5d AA.\n", 
> $id,$len),
>                 join("\n", at dr,),"\nSQ   SEQUENCE   $len AA;  $crc
> CRC64;$seq//\n" );
>         }
>         # Replace input string by results
>         $$stringref = join('', at pSeq);
>
>         }
>     }
>
>     elsif ($args{'type'} eq 'file') {
>         open (F,$args{'location'}) or $self->throw("Couldn't open
> $args{location}: $!");
>         # this is dumb, but the error may be anywhere on the first  
> three
> lines because the
>         # CGI headers are sometimes printed out by the server...
>         my @data = (scalar <F>,scalar <F>,scalar <F>);
>         if (join('', at data) =~ /^ERROR (\d+) (.+)/m) {
>             $self->throw("BioFetch Error $1: $2");
>         }
>         close F;
>     }
>
>     else {
>         $self->throw("Don't know how to postprocess data of type
> $args{'type'}");
>     }
> }
>
> 1;
>
>
> Samuel GRANJEAUD - IR/IFR137 wrote:
>> Hello !
>>
>> I would like to fill a BioSeq object with the output from a dbfetch
>> request at EI on UniParc database (which replies only XML code, as  
>> I am
>> interested in references). If somebody could tell which BioPerl  
>> object
>> to use or a way or convert it in Swiss format or could tell me the  
>> way
>> to do it or has got a piece of code (is
>> http://doc.bioperl.org/bioperl-live/Bio/SeqIO/interpro.html a good
>> starting point), I would appreciate a lot.
>>
>> Best regards,
>> --Samuel
>>
>> <entry accession="UPI00004A0D4A">
>> <dbReferenceList>
>>     <dbReference db="EMBL" id="CAI39485" version="1" version_i="1"
>> active="Y" created="04-Jan-2005" last="15-Dec-2006"/>
>>     <dbReference db="UniProtKB/TrEMBL" id="Q5JVT0" version="1"
>> version_i="1" active="N" created="15-Feb-2005" last="06-Feb-2007"/>
>>     <dbReference db="ENSEMBL" id="ENSP00000352958" version_i="2"
>> active="Y" created="03-Apr-2006" last="27-Nov-2006"/>
>>     <dbReference db="IPI" id="IPI00418471" version="4" version_i="4"
>> active="N" created="07-Mar-2005" last="07-Mar-2005"/>
>>     <dbReference db="IPI" id="IPI00646867" version="1" version_i="1"
>> active="N" created="06-Sep-2005" last="06-Oct-2006"/>
>>     <dbReference db="VEGA" id="OTTHUMP00000019225" version_i="1"
>> active="N" created="15-Aug-2005" last="02-Dec-2005"/>
>> </dbReferenceList>
>> <sequence length="431" crc64="8913D1F04A71CCFB">
>> MSTRSVSSSSYRRMFGGPGTASRPSSSRSYVTTSTRTYSLGSALRPSTSRSLYASSPGGV
>> YATRSSAVRLRSSVPGVRLLQDSVDFSLADAINTEFKNTRTNEKVELQELNDRFANYIDK
>> VRFLEQQNKILLAELEQLKGQGKSRLGDLYEEEMRELRRQVDQLTNDKARVEVERDNLAE
>> DIMRLREKLQEEMLQREEAENTLQSFRQDVDNASLARLDLERKVESLQEEIAFLKKLHEE
>> EIQELQAQIQEQHVQIDVDVSKPDLTAALRDVRQQYESVAAKNLQEAEEWYKSKFADLSE
>> AANRNNDALRQAKQESTEYRRQVQSLTCEVDALKGTNESLERQMREMEENFAVEAANYQD
>> TIGRLQDEIQNMKEEMARHLREYQDLLNVKMALDIEIATYRKLLEGEESRISLPLPNFSS
>> LNLRGKHFISL
>> </sequence>
>> </entry>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign