[Bioperl-l] Tag handling on SeqFeature::Generic

Marco Aurelio Valtas Cunha mavcunha@gordon.fmrp.usp.br
Wed, 28 Aug 2002 14:41:12 -0300


This is a multi-part message in MIME format.
--------------070400030709020403030008
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

Hi Hilmar,

I avoid post scripts to not load you guys, cause I know that everybody 
is working and when there's a time you respond to the list.

Probably you won't be able to run this script cause will be missing some 
files, but in short what I try to do is:

Data: A megablast result (Human NT contigs vs 900K ESTS) and a DB::GFF 
(human mapview information from NCBI)

First problem is that GFF format is not a hierarchical format, so if I 
blasted with NT then I need retrieve the position of the NT on the 
Chromosome so I can discover the position of my ESTs on the chromosome.

Then I parse the megablast using Bonnie module MBPlite.pm, with DB::GFF 
( and a help from Lincoln Stein) I convert blast coordenates to absolute 
coordenates so I can save the GFF file with the right information, and 
this file will be uploaded to my MySQL (DB::GFF) database.

To do all I use Bioperl, but I'm little experienced on Bioperl so 
sometimes I choose not the best way to solve my problems.

Marco.


Hilmar Lapp wrote:
> 
>>-----Original Message-----
>>From: Jason Stajich [mailto:jason@cgt.mc.duke.edu]
>>Sent: Wednesday, August 28, 2002 7:32 AM
>>To: Marco Aurelio Valtas Cunha
>>Cc: Bioperl
>>Subject: Re: [Bioperl-l] Tag handling on SeqFeature::Generic
> 
> [...]
> 
>>>I felling that using SeqFeature::Generic is wrong. But I'm really
>>>confused with Tools:GFF and SeqFeature::Generic in both 
>>
>>modules you can
>>
>>To clarify:
>>
>>Bio::Tools::GFF is probably badly named -- it is just and Input/Output
>>mechanism for Bio::SeqFeatureI objects and GFF format.
> 
> 
> In fact internally calling $seqfeature->gff_string() delegates to an instance of Bio::Tools::GFF to do the job. So there's really no difference between the output.
> 
> Bio::Tools::GFF is indeed badly named -- it should be in a Bio::SeqFeatureIO directory to be consistent with Bioperl's other IO module layouts.
> 
> Marco, if you want comments on how you accomplished the job in your script you should post the script, or its relevant parts.
> 
> 	-hilmar
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 
> 


-- 
##############################################################
# Atenção meu email mudou para  mavcunha@bit.fmrp.usp.br     #
# Veja porque http://scarecrow.fmrp.usp.br/~mavcunha/public  #
# Attention my email changed to mavcunha@bit.fmrp.usp.br     #
# See why here http://scarecrow.fmrp.usp.br/~mavcunha/public #
##############################################################
Marco Aurélio Valtas Cunha
Laboratório de Bioinformática
Hemocentro de Ribeirão Preto
Faculdade de Medicina de Ribeirão Preto
Universidade de São Paulo
Tel 55 16 3963-9300 R: 9603
http://bit.fmrp.usp.br
http://scarecrow.fmrp.usp.br/~mavcunha/public/
email: mavcunha@bit.fmrp.usp.br

--------------070400030709020403030008
Content-Type: text/plain;
 name="megabl2gff.pl"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
 filename="megabl2gff.pl"

#!/usr/bin/perl
#
#              INGLÊS/ENGLISH
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#  http://www.gnu.org/copyleft/gpl.html
#
#
#             PORTUGUÊS/PORTUGUESE
#  Este programa é distribuído na expectativa de ser útil aos seus
#  usuários, porém NÃO TEM NENHUMA GARANTIA, EXPLÍCITAS OU IMPLÍCITAS,
#  COMERCIAIS OU DE ATENDIMENTO A UMA DETERMINADA FINALIDADE.  Consulte
#  a Licença Pública Geral GNU para maiores detalhes.
#  http://www.gnu.org/copyleft/gpl.html
#
#  Copyright (C) 2002  Fundação Hemocentro de Ribeirão Preto
#
#  Laboratório de Bioinformática
#  BiT -  Bioinformatic Team
#  Fundação Hemocentro de Ribeirão Preto
#  Rua Tenente Catão Roxo, 2501
#  Ribeirão Preto - São Paulo
#  Brasil
#  CEP 14051-140
#  Fone: 55 16 39639300 Ramal 9603
#
#  Marco Aurelio Valtas Cunha
#  mavcunha@bit.fmrp.usp.br
#  http://bit.fmrp.usp.br
#
# $Id: megabl2gff.pl,v 1.1 2002/08/26 22:24:59 mavcunha Exp $

=head1 NAME 

=head1 SYNOPSIS

=head1 ABSTRACT

=head1 DESCRIPTION

=head1 AUTHOR

Marco Valtas <mavcunha@bit.fmrp.usp.br>

Copyright (c) 2002 Regional Blood Center of Ribeirao Preto

=head1 LICENSE

GNU General Public License

http://www.gnu.org/copyleft/gpl.html


=cut
use strict;
use warnings;
use Getopt::Long;
use Bio::SeqFeature::Generic; # Working with sequence features;
use Bio::Tools::MBPlite;      # Handle megablast;
use Bio::Tools::GFF;          # Handle GFF with NT reports;
use Bio::DB::GFF;
use vars qw($DSN $HOST $USER $PASS $CONNECTION $nt_seqref_file $megablast_file);

Usage("Too few arguments") if $#ARGV<0;
GetOptions("help"=> sub { Usage(); },
"dsn|database=s"=>\$DSN,
"u|user=s"=>\$USER,
"p|password=s"=>\$PASS,
"h|host=s"=>\$HOST,
"c=s"=>\$CONNECTION,
"ref=s"=>\$nt_seqref_file,
"megabl=s"=>\$megablast_file,
) or Usage();


# Easy way to write parameters.
if($CONNECTION){
        ($USER,$PASS,$DSN,$HOST) = $CONNECTION =~/(\S+):(\S+):(\S+)@(\S+)/;
}

# My NT file, to calc absolute coordenates.
open NT_SEQREF,$nt_seqref_file or die;
my $ntgff_io = new Bio::Tools::GFF(-fh=>\*NT_SEQREF, -gff_version=>2);

print STDERR "Parsing Contig information\n";
my %contig_ref; # Variable that will map versions, if they are missing
while(my $feature = $ntgff_io->next_feature()){
        my ($contig_acc) = $feature->each_tag_value("Contig");
        my $contig_acc_noversion = $contig_acc;
        $contig_acc_noversion =~s/^(\w+)\.\d+/$1/; # strips out the version;
        $contig_ref{$contig_acc_noversion} = $contig_acc; # The reference for contigs.
}
close NT_SEQREF; undef($ntgff_io); 
print STDERR "contigs parsing ended.\n";

# Let's open a megablast report
my $mega_h = new Bio::Tools::MBPlite($megablast_file);

# GFF database
my $gff_db = new Bio::DB::GFF(-dsn=>"$DSN\:$HOST",-user=>"$USER",-pass=>"$PASS") 
        or die "Can't open database: ",Bio::DB::GFF->error,"USER: $USER PASS: $PASS DSN: $DSN HOST: $HOST\n";
$gff_db->absolute(1);
#$gff_db->debug(1);

my $gffio = new Bio::Tools::GFF(-gff_version => 2 );

# Now the megablast file.
RESULTS:{
        # This will set  the refseq of GFF
        my $db = $mega_h->database;
        $db =~s/^.*chr(\w)"?$/Chr$1/; # This exp is relative to my Database information
        #$gff_io->seqname($db);
        
        # Finding the query name, it will be the Sequence tag on GFF
        my $query = $mega_h->query;
        $query=~s/^(\S+).*$/$1/;

        while(my $sbjct = $mega_h->nextSbjct){
                my $sbjct_name = $sbjct->name; # This will avoid future calls;

                $sbjct_name=~s/^(\S+)\s*$/$1/;

                while(my $hsp = $sbjct->nextHSP){

                        my $segment = $gff_db->segment(-class=>"Contig",
                                                -name=>$contig_ref{$sbjct_name},
                                                -start=>$hsp->hit->start,
                                                -end=>$hsp->hit->end);

                        if(not $segment){
                                die "Can't create a segment! Check if $contig_ref{$sbjct_name} exists on $DSN\:$HOST\n";
                        }
                        
                        # This will be my GFF conversor.
                        my $feature = new Bio::SeqFeature::Generic(
                                -seqname => $db,
                                -primary => "alignment",
                                -source  => "HCGP.MEGABLAST",
                                -start   => $hsp->hit->start,
                                -end     => $hsp->hit->end,
                                -score   => $hsp->P,
                                -strand  => $hsp->hit->strand == "1" ? "+" : "-",
                                -tag     => {
                                        Target => "Sequence:$query ".$hsp->query->start." ".$hsp->query->end,
                                }
                        );
                        
                        #$gff_io->start( $hsp->hit->start);
                        #$gff_io->end( $hsp->hit->end );
                        
                        # Handling strand
                        #$hsp->hit->strand == "1" ? $gff_io->strand("+") : $gff_io->strand("-");
                        
                        #$gff_io->score($hsp->P);
                        
                        #$gff_io->add_tag_value("Target","Sequence:$query");
                        
                        #print $gff_io->gff_string().$hsp->query->start." ".$hsp->query->end."\n";
                        
                        #$gff_io->remove_tag("Target");
                        
                        $gffio->write_feature($feature);
                }
        }
        # Loop for each megablast result.
        last RESULTS if ($mega_h->_parseHeader == -1);
        redo RESULTS;
}

exit(0);


# Subs
sub Usage {
	my($msg) = @_;
print STDERR "\nERR: $msg\n\n" if $msg;
print STDERR qq[$0  ].q[$Revision$].qq[\n];
print STDERR<<EOU;
Marco Valtas (mavcunha\@bit.fmrp.usp.br) 
(c)2002 Regional Blood Center of Ribeirao Preto

Usage --dsn DATABASE -u USER -p PASSWORD -h HOST --ref NT_REF --megabl MEGABLAST [-c USER:PASS:DSN\@HOST]

EOU
exit(1);
}

--------------070400030709020403030008--