[Bioperl-pipeline] any hope of getting the example to work?

02 Oct 2002 00:39:27 +0800

--=-I1MKKiDf3zdbmL6HHWs6
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

Ok, the problem with storing the genbank features was that
in Bio::EnsEMBL::DBSQL::FeatureAdaptor, it was validating the features
by checking $feat->isa("Bio::EnsEMBL::SeqFeatureI").

Naturally, SeqIO returns Bio::SeqFeature::Generic which inherits from
Bio::SeqFeatureI instead. The simple solution would be to convert each
bioperl seqfeature to a ensembl seqfeature before storing.

I have attached the following load_contigs.pl script which loads genbank
files which works with ensembl-4-1 and bioperl-live.
Also since each feature is associated with an analysis in the Ensembl
schema, I artificially set the analysis to one which has the format of
the file as the logic_name. Change it as u wish.

hope this helps.

shawn

On Tue, 2002-10-01 at 22:39, Andy Nunberg wrote:
> At 09:58 AM 10/1/2002 +0800, you wrote:
> >Hi Andy,
> >
> >sorry we have been tied up with the ciona genome work.
> >
> >
> >> Hi all,
> >> since I cant get the first script in your example to work because of the
> >> XML module, do you have any ideas how to fix this?  can you send an older
> >> verision that works?
> >
> >give me a day or 2 ( I know I have been saying this) and I will create a
> >dump of a blast pipeline for you. Kiran is working hard on the XML.
> >
> >I'm assuming you will want to store your features in Ensembl to view them.
> >
> >This will be the layout:
> >
> >You have your BACs stored in Ensembl (the reason why this is needed is because
> >				     the features need a reference sequence for viewing them on the 
> >browser)
> >
> >Contigs will be retrieved from EnsEMBL and blast against dbfiles in the 
> >bioperl-pipeline.
> >(Your dbfiles will be your proprietary data).
> >
> >Features are stored into the contig table in Ensembl.
> >
> >> website not theirs) and how to load genbank files into it.  
> >
> >in the ensembl-pipeline/scripts directory, there is a load_scaffolds.pl 
> >script which
> >works on fasta files. You could convert your genbank files to fasta and load 
> >them or
> >just replace the following line:
> >
> >my $seqio = new Bio::SeqIO(-format=>'Fasta',
> >                           -file=>$seqfile);
> >with
> >
> >my $seqio = new Bio::SeqIO(-format=>'genbank',
> >                           -file=>$seqfile);
> >
> >this should work with the stable release of  ensembl
> >
> >Note though, with genbank, the script will try and store the features as
> well, 
> >and you will get a bunch of errors for that since the ensembl and bioperl 
> >features
> >aren't really compatible. You can ignore them since you are only interested in
> >the sequences. Or  you could just comment out the storing of features.
> Actually I want the features from Genbank, because I want to see where my
> sequences are hitting what features
> on the BAC. Is this possible to do?
> 
> >
> >> I'd love to see a course at the O'Reilly meeting where I can learn to do
> >> these basic things. (IE How to set up EnsEMBL database and populate with
> >> your own data..  How to setup a basic bioperl-pipeline..I'd be happy just
> >> setting up a couple of BLASTs!)
> >
> >when we get stable hopefully :)
> >
> >
> >shawn
> >
> >-- 
> >********************************
> >* Shawn Hoon
> >* http://www.fugu-sg.org/~shawnh
> >********************************
> 
> *******************************************************************
> Andy Nunberg, Ph.D
> Computational Biologist
> Orion Genomics, LLC 
> (314) 615-6989
> http://www.oriongenomics.com
> 
> _______________________________________________
> bioperl-pipeline mailing list
> bioperl-pipeline@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-pipeline
> 

--=-I1MKKiDf3zdbmL6HHWs6
Content-Disposition: attachment; filename=load_contigs.pl
Content-Transfer-Encoding: quoted-printable
Content-Type: text/x-perl; name=load_contigs.pl; charset=ISO-8859-1

#!/usr/local/bin/perl -w

=3Dhead1 NAME

load_scaffolds.pl

=3Dhead1 SYNOPSIS
=20
  load_scaffolds.pl=20

=3Dhead1 DESCRIPTION

This script loads a fasta file into the ensembl clone/contig/dna schema. BE=
WARE! It is for gunshot sequencing projects like fugu, where one clone =3D =
one contig is the rule (and they are usually referred to as scaffolds. It d=
oes not deal with multiple contigs per clone.

If the option -pipe is set it also fills the ensembl pipeline InputIdAnalys=
is with each contig id and value 1 (i.e. ready to be analysed)

=3Dhead1 OPTIONS

    -host    host name for database (gets put as host=3D in locator)
    -dbname  For RDBs, what name to connect to (dbname=3D in locator)
    -dbuser  For RDBs, what username to connect as (dbuser=3D in locator)
    -dbpass  For RDBs, what password to use (dbpass=3D in locator)
    -format  file format(fasta,genbank etc)=20
    -help      displays this documentation with PERLDOC

=3Dcut

use strict;
use vars qw($USER $PASS $DB $HOST $DSN);
use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::SeqFeature;
use Bio::EnsEMBL::Analysis;
use Bio::SeqIO;
use Bio::EnsEMBL::PerlDB::Contig;
use Bio::EnsEMBL::PerlDB::Clone;
use Getopt::Long;

my $host   =3D 'mysql';
my $port   =3D '';
my $dbname =3D '';
my $dbuser =3D 'root';
my $dbpass =3D undef;
my $format =3D "fasta";

my $help =3D 0;
my $verbose =3D 0;

&GetOptions(=20
	     'host:s'     =3D> \$host,
	     'port:n'     =3D> \$port,
	     'dbname:s'   =3D> \$dbname,
	     'dbuser:s'   =3D> \$dbuser,
	     'dbpass:s'   =3D> \$dbpass,
       'format:s'   =3D> \$format,
	     'verbose'    =3D> \$verbose,
	     'h|help'     =3D> \$help
	     );
if ($help) {
    exec('perldoc', $0);
}

$dbname || exec('perldoc',$0);

$SIG{INT} =3D sub {my $sig=3Dshift;die "exited after SIG$sig";};

my $db =3D  Bio::EnsEMBL::DBSQL::DBAdaptor->new(-dbname=3D>$dbname,-user=3D=
>$dbuser,-host=3D>$host,-pass=3D>$dbpass);

my ($seqfile) =3D @ARGV;

if( !defined $seqfile ) { die 'cannot load because sequence file to load se=
quences from was not passed in as argument';}

my $seqio =3D Bio::SeqIO->new(-format =3D>lc($format),
                            -file   =3D>$seqfile);
my $count =3D 0;
while ( my $seq =3D $seqio->next_seq ) {
    $seq =3D &convert_seq_features($seq,$format);
    my $cloneid=3D $seq->id;
    my $contigid =3D $cloneid.".1";
    $verbose && print STDERR ("Parsed contig $contigid : contig length ".$s=
eq->length."\n");
    my $clone     =3D new Bio::EnsEMBL::PerlDB::Clone;
    my $contig    =3D new Bio::EnsEMBL::PerlDB::Contig;
    $clone->htg_phase(3);
    $clone->id($cloneid);
    $contig->id($contigid);
    $contig->internal_id($count++);
    $contig->seq($seq);   =20
    $clone->add_Contig($contig);
    eval {=20
       $db->write_Clone($clone);
       $verbose && print STDERR "Written ".$clone->id." scaffold into db\n"=
;
    };
    if( $@ ) {
      print STDERR "Could not write clone into database, error was $@\n";
    }
}

sub convert_seq_features {
    my ($seq,$format) =3D @_;
        my @feat =3D $seq->all_SeqFeatures;
        my $anal =3D Bio::EnsEMBL::Analysis->new(-logic_name=3D>$format);
        $anal->dbID(1);

        my @ens;
        foreach my $feat (@feat){
            my $ens =3D Bio::EnsEMBL::SeqFeature->new(-start=3D>$feat->star=
t,
                                                    -end  =3D> $feat->end,
                                                    -strand =3D> $feat->str=
and,
                                                    -source_tag=3D>$feat->s=
ource_tag,
                                                    -primary_tag=3D>$feat->=
primary_tag,
                                                    -analysis=3D>$anal,
                                                    -score=3D>100,
                                                    -seqname  =3D> $feat->s=
eqname);

            push @ens, $ens;
        }

        $seq->flush_SeqFeatures;
        $seq->add_SeqFeature(@ens);
        return $seq;
}

--=-I1MKKiDf3zdbmL6HHWs6--