[Bioperl-l] Re (3): Status of assembly modules

Fri Dec 10 23:53:38 UTC 2010

I am wondering if there is a way to optimize the BioPerl code for Assembly
IO.  Specifically, when I convert a 2.2 MB genome (~200 contigs) from a 454
ace file to a regular ace file, it takes a few hours to get through 30
contigs using the code below (I estimate more than a day to get through all
of it).

Is there a way to optimize it?  To convert a sequence file to another format
at most would take a minute and therefore converting an ace on the magnitude
of hours or days is too much.  I wish I understood bioperl better but I
think the best I can do is issue a challenge or a feature request:  who can
speed up Assembly::IO::ace?

# convert a Newbler ace to a standard ace
sub _newblerAceToAce($args){
  my($self,$args)=@_;
  my
$ace454=Bio::Assembly::IO->new(-file=>$$args{ace454Path},-format=>"ace",-variant=>'454');
  my $ace=Bio::Assembly::IO->new(-file=>">$$args{acePath}",-format=>"ace");
#output ace
  my $numContigs=`grep -c ^CO $$args{ace454Path}`+0;
  logmsg "Converting $$args{ace454Path} (454-ace) to $$args{acePath} (ace).
$numContigs contigs total.";
  while(my $contig=$ace454->next_contig){
    logmsg "Finished with ".$contig->id ." out of $numContigs";
    $ace->write_contig($contig);
  }
  return $$args{acePath};
}

Message: 3

Date: Mon, 22 Nov 2010 15:18:10 -0500

From: Lee Katz <lskatz at gatech.edu>

Subject: [Bioperl-l] Re(2): Status of assembly modules

To: bioperl-l at lists.open-bio.org

Message-ID:

       <AANLkTi=JShCLsHDxHK4eeWD3Da=vWmRkGN2rkuLwCjxn at mail.gmail.com>

Content-Type: text/plain; charset=UTF-8

I figured it out (I haven't tested much though).

To whoever works on Assembly::IO::ace.pm:

I changed a regular expression on line 231 because the contig object was not

initializing properly.  For some reason the 454 ace file had adopted the

reference assembly's ID and therefore there was a GI number followed by a

pipe.  The pipe was not captured with \w+.  I think that the regex will be

safe with \s(\S+)\s.

 if (/^CO\s(\S+)\s(\d+)\s(\d+)\s(\d+)\s(\w+)/xms) { # New contig starts!

#if (/^CO\s(\w+)\s(\d+)\s(\d+)\s(\d+)\s(\w+)/xms) { # New contig starts!

On Thu, Nov 18, 2010 at 12:04 PM, <bioperl-l-request at lists.open-bio.org
>wrote:

> Message: 3

> Date: Wed, 17 Nov 2010 22:20:03 -0500

> From: Lee Katz <lskatz at gatech.edu>

> Subject: Re: [Bioperl-l] Status of assembly modules

> To: bioperl-l at lists.open-bio.org

> Message-ID:

>        <AANLkTi=aFAnYrEXj3D4joZeYwxRT971M_ZYR0uFJOrxc at mail.gmail.com>

> Content-Type: text/plain; charset=UTF-8

>

> I have read on the BioPerl site that a 454 ace is not standardized due to

> its coordinate system.  How can I convert it to the standard ace file?

>

> When I run this code either by using contig or assembly objects, I get an

> error.

> Can't call method "get_consensus_sequence" on an undefined value at

> Bio/Assembly/IO/ace.pm line 280, <GEN0> line 93349.

>

>    sub _newblerAceToAce($args){

>      my($self,$args)=@_;

>      my

>

> $ace454=Bio::Assembly::IO->new(-file=>$args{ace454Path},-format=>"ace
",-variant=>'454');

>      my

> $ace=Bio::Assembly::IO->new(-file=>">$args{acePath}",-format=>"ace");

>      #while(my $contig=$ace454->next_contig){

>      while(my $scaffold=$ace454->next_assembly){

>        print Dumper $scaffold;

>      }

>      return $args{acePath};

>    }

>