[Bioperl-l] Tag handling on SeqFeature::Generic
Marco Aurelio Valtas Cunha
mavcunha@gordon.fmrp.usp.br
Wed, 28 Aug 2002 14:41:12 -0300
This is a multi-part message in MIME format.
--------------070400030709020403030008
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Hi Hilmar,
I avoid post scripts to not load you guys, cause I know that everybody
is working and when there's a time you respond to the list.
Probably you won't be able to run this script cause will be missing some
files, but in short what I try to do is:
Data: A megablast result (Human NT contigs vs 900K ESTS) and a DB::GFF
(human mapview information from NCBI)
First problem is that GFF format is not a hierarchical format, so if I
blasted with NT then I need retrieve the position of the NT on the
Chromosome so I can discover the position of my ESTs on the chromosome.
Then I parse the megablast using Bonnie module MBPlite.pm, with DB::GFF
( and a help from Lincoln Stein) I convert blast coordenates to absolute
coordenates so I can save the GFF file with the right information, and
this file will be uploaded to my MySQL (DB::GFF) database.
To do all I use Bioperl, but I'm little experienced on Bioperl so
sometimes I choose not the best way to solve my problems.
Marco.
Hilmar Lapp wrote:
>
>>-----Original Message-----
>>From: Jason Stajich [mailto:jason@cgt.mc.duke.edu]
>>Sent: Wednesday, August 28, 2002 7:32 AM
>>To: Marco Aurelio Valtas Cunha
>>Cc: Bioperl
>>Subject: Re: [Bioperl-l] Tag handling on SeqFeature::Generic
>
> [...]
>
>>>I felling that using SeqFeature::Generic is wrong. But I'm really
>>>confused with Tools:GFF and SeqFeature::Generic in both
>>
>>modules you can
>>
>>To clarify:
>>
>>Bio::Tools::GFF is probably badly named -- it is just and Input/Output
>>mechanism for Bio::SeqFeatureI objects and GFF format.
>
>
> In fact internally calling $seqfeature->gff_string() delegates to an instance of Bio::Tools::GFF to do the job. So there's really no difference between the output.
>
> Bio::Tools::GFF is indeed badly named -- it should be in a Bio::SeqFeatureIO directory to be consistent with Bioperl's other IO module layouts.
>
> Marco, if you want comments on how you accomplished the job in your script you should post the script, or its relevant parts.
>
> -hilmar
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
>
--
##############################################################
# Atenção meu email mudou para mavcunha@bit.fmrp.usp.br #
# Veja porque http://scarecrow.fmrp.usp.br/~mavcunha/public #
# Attention my email changed to mavcunha@bit.fmrp.usp.br #
# See why here http://scarecrow.fmrp.usp.br/~mavcunha/public #
##############################################################
Marco Aurélio Valtas Cunha
Laboratório de Bioinformática
Hemocentro de Ribeirão Preto
Faculdade de Medicina de Ribeirão Preto
Universidade de São Paulo
Tel 55 16 3963-9300 R: 9603
http://bit.fmrp.usp.br
http://scarecrow.fmrp.usp.br/~mavcunha/public/
email: mavcunha@bit.fmrp.usp.br
--------------070400030709020403030008
Content-Type: text/plain;
name="megabl2gff.pl"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
filename="megabl2gff.pl"
#!/usr/bin/perl
#
# INGLÊS/ENGLISH
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# http://www.gnu.org/copyleft/gpl.html
#
#
# PORTUGUÊS/PORTUGUESE
# Este programa é distribuído na expectativa de ser útil aos seus
# usuários, porém NÃO TEM NENHUMA GARANTIA, EXPLÍCITAS OU IMPLÍCITAS,
# COMERCIAIS OU DE ATENDIMENTO A UMA DETERMINADA FINALIDADE. Consulte
# a Licença Pública Geral GNU para maiores detalhes.
# http://www.gnu.org/copyleft/gpl.html
#
# Copyright (C) 2002 Fundação Hemocentro de Ribeirão Preto
#
# Laboratório de Bioinformática
# BiT - Bioinformatic Team
# Fundação Hemocentro de Ribeirão Preto
# Rua Tenente Catão Roxo, 2501
# Ribeirão Preto - São Paulo
# Brasil
# CEP 14051-140
# Fone: 55 16 39639300 Ramal 9603
#
# Marco Aurelio Valtas Cunha
# mavcunha@bit.fmrp.usp.br
# http://bit.fmrp.usp.br
#
# $Id: megabl2gff.pl,v 1.1 2002/08/26 22:24:59 mavcunha Exp $
=head1 NAME
=head1 SYNOPSIS
=head1 ABSTRACT
=head1 DESCRIPTION
=head1 AUTHOR
Marco Valtas <mavcunha@bit.fmrp.usp.br>
Copyright (c) 2002 Regional Blood Center of Ribeirao Preto
=head1 LICENSE
GNU General Public License
http://www.gnu.org/copyleft/gpl.html
=cut
use strict;
use warnings;
use Getopt::Long;
use Bio::SeqFeature::Generic; # Working with sequence features;
use Bio::Tools::MBPlite; # Handle megablast;
use Bio::Tools::GFF; # Handle GFF with NT reports;
use Bio::DB::GFF;
use vars qw($DSN $HOST $USER $PASS $CONNECTION $nt_seqref_file $megablast_file);
Usage("Too few arguments") if $#ARGV<0;
GetOptions("help"=> sub { Usage(); },
"dsn|database=s"=>\$DSN,
"u|user=s"=>\$USER,
"p|password=s"=>\$PASS,
"h|host=s"=>\$HOST,
"c=s"=>\$CONNECTION,
"ref=s"=>\$nt_seqref_file,
"megabl=s"=>\$megablast_file,
) or Usage();
# Easy way to write parameters.
if($CONNECTION){
($USER,$PASS,$DSN,$HOST) = $CONNECTION =~/(\S+):(\S+):(\S+)@(\S+)/;
}
# My NT file, to calc absolute coordenates.
open NT_SEQREF,$nt_seqref_file or die;
my $ntgff_io = new Bio::Tools::GFF(-fh=>\*NT_SEQREF, -gff_version=>2);
print STDERR "Parsing Contig information\n";
my %contig_ref; # Variable that will map versions, if they are missing
while(my $feature = $ntgff_io->next_feature()){
my ($contig_acc) = $feature->each_tag_value("Contig");
my $contig_acc_noversion = $contig_acc;
$contig_acc_noversion =~s/^(\w+)\.\d+/$1/; # strips out the version;
$contig_ref{$contig_acc_noversion} = $contig_acc; # The reference for contigs.
}
close NT_SEQREF; undef($ntgff_io);
print STDERR "contigs parsing ended.\n";
# Let's open a megablast report
my $mega_h = new Bio::Tools::MBPlite($megablast_file);
# GFF database
my $gff_db = new Bio::DB::GFF(-dsn=>"$DSN\:$HOST",-user=>"$USER",-pass=>"$PASS")
or die "Can't open database: ",Bio::DB::GFF->error,"USER: $USER PASS: $PASS DSN: $DSN HOST: $HOST\n";
$gff_db->absolute(1);
#$gff_db->debug(1);
my $gffio = new Bio::Tools::GFF(-gff_version => 2 );
# Now the megablast file.
RESULTS:{
# This will set the refseq of GFF
my $db = $mega_h->database;
$db =~s/^.*chr(\w)"?$/Chr$1/; # This exp is relative to my Database information
#$gff_io->seqname($db);
# Finding the query name, it will be the Sequence tag on GFF
my $query = $mega_h->query;
$query=~s/^(\S+).*$/$1/;
while(my $sbjct = $mega_h->nextSbjct){
my $sbjct_name = $sbjct->name; # This will avoid future calls;
$sbjct_name=~s/^(\S+)\s*$/$1/;
while(my $hsp = $sbjct->nextHSP){
my $segment = $gff_db->segment(-class=>"Contig",
-name=>$contig_ref{$sbjct_name},
-start=>$hsp->hit->start,
-end=>$hsp->hit->end);
if(not $segment){
die "Can't create a segment! Check if $contig_ref{$sbjct_name} exists on $DSN\:$HOST\n";
}
# This will be my GFF conversor.
my $feature = new Bio::SeqFeature::Generic(
-seqname => $db,
-primary => "alignment",
-source => "HCGP.MEGABLAST",
-start => $hsp->hit->start,
-end => $hsp->hit->end,
-score => $hsp->P,
-strand => $hsp->hit->strand == "1" ? "+" : "-",
-tag => {
Target => "Sequence:$query ".$hsp->query->start." ".$hsp->query->end,
}
);
#$gff_io->start( $hsp->hit->start);
#$gff_io->end( $hsp->hit->end );
# Handling strand
#$hsp->hit->strand == "1" ? $gff_io->strand("+") : $gff_io->strand("-");
#$gff_io->score($hsp->P);
#$gff_io->add_tag_value("Target","Sequence:$query");
#print $gff_io->gff_string().$hsp->query->start." ".$hsp->query->end."\n";
#$gff_io->remove_tag("Target");
$gffio->write_feature($feature);
}
}
# Loop for each megablast result.
last RESULTS if ($mega_h->_parseHeader == -1);
redo RESULTS;
}
exit(0);
# Subs
sub Usage {
my($msg) = @_;
print STDERR "\nERR: $msg\n\n" if $msg;
print STDERR qq[$0 ].q[$Revision$].qq[\n];
print STDERR<<EOU;
Marco Valtas (mavcunha\@bit.fmrp.usp.br)
(c)2002 Regional Blood Center of Ribeirao Preto
Usage --dsn DATABASE -u USER -p PASSWORD -h HOST --ref NT_REF --megabl MEGABLAST [-c USER:PASS:DSN\@HOST]
EOU
exit(1);
}
--------------070400030709020403030008--