[Bioperl-l] Reading a XML sequence (UniParc) into a BioSeq object

Wed Mar 14 13:19:45 UTC 2007

Hi,

Since nobody gave me a clue nor told me that my question is silly (it 
should be ;-) ), I finally realized a hack within an object that 
inherits from BioFetch and overloads post process method, converting 
uniparc XML to swiss format. The really nice approach of parsing uniparc 
XML and creating a object was too hard for me.

It's amazing what BioPerl can do.

Regards,
--Samuel

=head1 NAME

ICIM::Bio::DB::BioFetch - Database object interface to BioFetch retrieval

=head1 SYNOPSIS

 see Bio::DB::BioFetch

=head1 DESCRIPTION

See Bio::DB::BioFetch for main description.

The Begin code adds a few databases.

The post_process method converts UniParc XML format to Swiss format
for string transfer type.

=head1 SEE ALSO

This module inherits from BioFetch.
http://doc.bioperl.org/bioperl-live/Bio/DB/BioFetch.html

This module is a light copy of BioFetch.

=head1 AUTHOR

Email Samuel Granjeaud, E<lt>granjeau at tagc.univ-mrs.frE<gt>

=head1 APPENDIX

The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _

=cut

# Let the code begin...

package ICIM::Bio::DB::BioFetch;

use strict;
use warnings;

use Bio::Root::IO;

use base qw(Bio::DB::BioFetch);

BEGIN {

    $Bio::DB::BioFetch::FORMATMAP{ipi} = {
        default   => 'swiss', # default BioFetch format/SeqIOmodule pair
        swissprot => 'swiss', # alternative BioFetch format/module pair
        fasta     => 'fasta', # alternative BioFetch format/module pair
        namespace => 'ipi',
    };
    $Bio::DB::BioFetch::FORMATMAP{uniparc} = {
        default   => 'swiss', # default BioFetch format/SeqIOmodule pair
        swissprot => 'swiss', # alternative BioFetch format/module pair
        fasta     => 'fasta', # alternative BioFetch format/module pair
        namespace => 'uniparc',
    };
}

=head2 postprocess_data

 Title   : postprocess_data
 Usage   : $self->postprocess_data ( 'type' => 'string',
                     'location' => \$datastr);
 Function: process downloaded data before loading into a Bio::SeqIO
 Returns : void
 Args    : hash with two keys - 'type' can be 'string' or 'file'
                              - 'location' either file location or string
                                 reference containing data

=cut

sub postprocess_data {
    my ($self,%args) = @_;

    # check for errors in the stream
    if ($args{'type'} eq 'string') {
        my $stringref = $args{'location'};
        if ($$stringref =~ /^ERROR (\d+) (.+)/m) {
            $self->throw("BioFetch Error $1: $2");
        }

        # Post-process: convert UniParc XML format in swiss format
        if ($$stringref =~ /^<entry accession=/m) {

        my @pSeq = ();
        while ($$stringref =~ /^(<entry accession=.+?)<\/entry>$/msg) {
            # Get an entry
            my $seqEntry = $1;
            $seqEntry =~ s/[\n\r]+/\n/g;
            # Get ID
            my ($id)  = ( $seqEntry =~ /<entry accession="(UPI\w+)">/m );
            # Get DR, database croos-references
            my @dr = ();
            while ($seqEntry =~ /<dbReference db="(\S+).*?" id="(\S+)" 
.+? active="(\S+)" created="(\S+)" last="(\S+)"\/>$/mg) {
                push (@dr, "DR   $1; $2; $3; $4; $5.");
            }
            # Get SQ, sequence itself
            my ($len,$crc,$seq) = ( $seqEntry =~ /<sequence 
length="(\S+)" crc64="(\S+)">$(.+?)<\/sequence>/ms );
            $seq =~ s/^/    /mg;
            $seq =~ s/(\w{10})/ $1/mg;
            $seq =~ s/(\w{10})(\w{1,9})$/$1 $2/m;
            $seq =~ s/^(    \w{1,9})$/ $1/m;
            # Append to results
            push( @pSeq,
                sprintf("ID   %-20s   Reviewed;       % 5d AA.\n",$id,$len),
                join("\n", at dr,),"\nSQ   SEQUENCE   $len AA;  $crc 
CRC64;$seq//\n" );
        }
        # Replace input string by results
        $$stringref = join('', at pSeq);

        }
    }

    elsif ($args{'type'} eq 'file') {
        open (F,$args{'location'}) or $self->throw("Couldn't open 
$args{location}: $!");
        # this is dumb, but the error may be anywhere on the first three 
lines because the
        # CGI headers are sometimes printed out by the server...
        my @data = (scalar <F>,scalar <F>,scalar <F>);
        if (join('', at data) =~ /^ERROR (\d+) (.+)/m) {
            $self->throw("BioFetch Error $1: $2");
        }
        close F;
    }

    else {
        $self->throw("Don't know how to postprocess data of type 
$args{'type'}");
    }
}

1;

Samuel GRANJEAUD - IR/IFR137 wrote:
> Hello !
>
> I would like to fill a BioSeq object with the output from a dbfetch
> request at EI on UniParc database (which replies only XML code, as I am
> interested in references). If somebody could tell which BioPerl object
> to use or a way or convert it in Swiss format or could tell me the way
> to do it or has got a piece of code (is
> http://doc.bioperl.org/bioperl-live/Bio/SeqIO/interpro.html a good
> starting point), I would appreciate a lot.
>
> Best regards,
> --Samuel
>
> <entry accession="UPI00004A0D4A">
> <dbReferenceList>
>     <dbReference db="EMBL" id="CAI39485" version="1" version_i="1" 
> active="Y" created="04-Jan-2005" last="15-Dec-2006"/>
>     <dbReference db="UniProtKB/TrEMBL" id="Q5JVT0" version="1" 
> version_i="1" active="N" created="15-Feb-2005" last="06-Feb-2007"/>
>     <dbReference db="ENSEMBL" id="ENSP00000352958" version_i="2" 
> active="Y" created="03-Apr-2006" last="27-Nov-2006"/>
>     <dbReference db="IPI" id="IPI00418471" version="4" version_i="4" 
> active="N" created="07-Mar-2005" last="07-Mar-2005"/>
>     <dbReference db="IPI" id="IPI00646867" version="1" version_i="1" 
> active="N" created="06-Sep-2005" last="06-Oct-2006"/>
>     <dbReference db="VEGA" id="OTTHUMP00000019225" version_i="1" 
> active="N" created="15-Aug-2005" last="02-Dec-2005"/>
> </dbReferenceList>
> <sequence length="431" crc64="8913D1F04A71CCFB">
> MSTRSVSSSSYRRMFGGPGTASRPSSSRSYVTTSTRTYSLGSALRPSTSRSLYASSPGGV
> YATRSSAVRLRSSVPGVRLLQDSVDFSLADAINTEFKNTRTNEKVELQELNDRFANYIDK
> VRFLEQQNKILLAELEQLKGQGKSRLGDLYEEEMRELRRQVDQLTNDKARVEVERDNLAE
> DIMRLREKLQEEMLQREEAENTLQSFRQDVDNASLARLDLERKVESLQEEIAFLKKLHEE
> EIQELQAQIQEQHVQIDVDVSKPDLTAALRDVRQQYESVAAKNLQEAEEWYKSKFADLSE
> AANRNNDALRQAKQESTEYRRQVQSLTCEVDALKGTNESLERQMREMEENFAVEAANYQD
> TIGRLQDEIQNMKEEMARHLREYQDLLNVKMALDIEIATYRKLLEGEESRISLPLPNFSS
> LNLRGKHFISL
> </sequence>
> </entry>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>