[Bioperl-l] Reading a XML sequence (UniParc) into a BioSeq object
Samuel GRANJEAUD - IR/IFR137
granjeau at tagc.univ-mrs.fr
Wed Mar 14 13:19:45 UTC 2007
Hi,
Since nobody gave me a clue nor told me that my question is silly (it
should be ;-) ), I finally realized a hack within an object that
inherits from BioFetch and overloads post process method, converting
uniparc XML to swiss format. The really nice approach of parsing uniparc
XML and creating a object was too hard for me.
It's amazing what BioPerl can do.
Regards,
--Samuel
=head1 NAME
ICIM::Bio::DB::BioFetch - Database object interface to BioFetch retrieval
=head1 SYNOPSIS
see Bio::DB::BioFetch
=head1 DESCRIPTION
See Bio::DB::BioFetch for main description.
The Begin code adds a few databases.
The post_process method converts UniParc XML format to Swiss format
for string transfer type.
=head1 SEE ALSO
This module inherits from BioFetch.
http://doc.bioperl.org/bioperl-live/Bio/DB/BioFetch.html
This module is a light copy of BioFetch.
=head1 AUTHOR
Email Samuel Granjeaud, E<lt>granjeau at tagc.univ-mrs.frE<gt>
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=cut
# Let the code begin...
package ICIM::Bio::DB::BioFetch;
use strict;
use warnings;
use Bio::Root::IO;
use base qw(Bio::DB::BioFetch);
BEGIN {
$Bio::DB::BioFetch::FORMATMAP{ipi} = {
default => 'swiss', # default BioFetch format/SeqIOmodule pair
swissprot => 'swiss', # alternative BioFetch format/module pair
fasta => 'fasta', # alternative BioFetch format/module pair
namespace => 'ipi',
};
$Bio::DB::BioFetch::FORMATMAP{uniparc} = {
default => 'swiss', # default BioFetch format/SeqIOmodule pair
swissprot => 'swiss', # alternative BioFetch format/module pair
fasta => 'fasta', # alternative BioFetch format/module pair
namespace => 'uniparc',
};
}
=head2 postprocess_data
Title : postprocess_data
Usage : $self->postprocess_data ( 'type' => 'string',
'location' => \$datastr);
Function: process downloaded data before loading into a Bio::SeqIO
Returns : void
Args : hash with two keys - 'type' can be 'string' or 'file'
- 'location' either file location or string
reference containing data
=cut
sub postprocess_data {
my ($self,%args) = @_;
# check for errors in the stream
if ($args{'type'} eq 'string') {
my $stringref = $args{'location'};
if ($$stringref =~ /^ERROR (\d+) (.+)/m) {
$self->throw("BioFetch Error $1: $2");
}
# Post-process: convert UniParc XML format in swiss format
if ($$stringref =~ /^<entry accession=/m) {
my @pSeq = ();
while ($$stringref =~ /^(<entry accession=.+?)<\/entry>$/msg) {
# Get an entry
my $seqEntry = $1;
$seqEntry =~ s/[\n\r]+/\n/g;
# Get ID
my ($id) = ( $seqEntry =~ /<entry accession="(UPI\w+)">/m );
# Get DR, database croos-references
my @dr = ();
while ($seqEntry =~ /<dbReference db="(\S+).*?" id="(\S+)"
.+? active="(\S+)" created="(\S+)" last="(\S+)"\/>$/mg) {
push (@dr, "DR $1; $2; $3; $4; $5.");
}
# Get SQ, sequence itself
my ($len,$crc,$seq) = ( $seqEntry =~ /<sequence
length="(\S+)" crc64="(\S+)">$(.+?)<\/sequence>/ms );
$seq =~ s/^/ /mg;
$seq =~ s/(\w{10})/ $1/mg;
$seq =~ s/(\w{10})(\w{1,9})$/$1 $2/m;
$seq =~ s/^( \w{1,9})$/ $1/m;
# Append to results
push( @pSeq,
sprintf("ID %-20s Reviewed; % 5d AA.\n",$id,$len),
join("\n", at dr,),"\nSQ SEQUENCE $len AA; $crc
CRC64;$seq//\n" );
}
# Replace input string by results
$$stringref = join('', at pSeq);
}
}
elsif ($args{'type'} eq 'file') {
open (F,$args{'location'}) or $self->throw("Couldn't open
$args{location}: $!");
# this is dumb, but the error may be anywhere on the first three
lines because the
# CGI headers are sometimes printed out by the server...
my @data = (scalar <F>,scalar <F>,scalar <F>);
if (join('', at data) =~ /^ERROR (\d+) (.+)/m) {
$self->throw("BioFetch Error $1: $2");
}
close F;
}
else {
$self->throw("Don't know how to postprocess data of type
$args{'type'}");
}
}
1;
Samuel GRANJEAUD - IR/IFR137 wrote:
> Hello !
>
> I would like to fill a BioSeq object with the output from a dbfetch
> request at EI on UniParc database (which replies only XML code, as I am
> interested in references). If somebody could tell which BioPerl object
> to use or a way or convert it in Swiss format or could tell me the way
> to do it or has got a piece of code (is
> http://doc.bioperl.org/bioperl-live/Bio/SeqIO/interpro.html a good
> starting point), I would appreciate a lot.
>
> Best regards,
> --Samuel
>
> <entry accession="UPI00004A0D4A">
> <dbReferenceList>
> <dbReference db="EMBL" id="CAI39485" version="1" version_i="1"
> active="Y" created="04-Jan-2005" last="15-Dec-2006"/>
> <dbReference db="UniProtKB/TrEMBL" id="Q5JVT0" version="1"
> version_i="1" active="N" created="15-Feb-2005" last="06-Feb-2007"/>
> <dbReference db="ENSEMBL" id="ENSP00000352958" version_i="2"
> active="Y" created="03-Apr-2006" last="27-Nov-2006"/>
> <dbReference db="IPI" id="IPI00418471" version="4" version_i="4"
> active="N" created="07-Mar-2005" last="07-Mar-2005"/>
> <dbReference db="IPI" id="IPI00646867" version="1" version_i="1"
> active="N" created="06-Sep-2005" last="06-Oct-2006"/>
> <dbReference db="VEGA" id="OTTHUMP00000019225" version_i="1"
> active="N" created="15-Aug-2005" last="02-Dec-2005"/>
> </dbReferenceList>
> <sequence length="431" crc64="8913D1F04A71CCFB">
> MSTRSVSSSSYRRMFGGPGTASRPSSSRSYVTTSTRTYSLGSALRPSTSRSLYASSPGGV
> YATRSSAVRLRSSVPGVRLLQDSVDFSLADAINTEFKNTRTNEKVELQELNDRFANYIDK
> VRFLEQQNKILLAELEQLKGQGKSRLGDLYEEEMRELRRQVDQLTNDKARVEVERDNLAE
> DIMRLREKLQEEMLQREEAENTLQSFRQDVDNASLARLDLERKVESLQEEIAFLKKLHEE
> EIQELQAQIQEQHVQIDVDVSKPDLTAALRDVRQQYESVAAKNLQEAEEWYKSKFADLSE
> AANRNNDALRQAKQESTEYRRQVQSLTCEVDALKGTNESLERQMREMEENFAVEAANYQD
> TIGRLQDEIQNMKEEMARHLREYQDLLNVKMALDIEIATYRKLLEGEESRISLPLPNFSS
> LNLRGKHFISL
> </sequence>
> </entry>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list