[Bioperl-l] Query Unigene title from input a ACC number / BioPerl Object Creation

Tue Mar 25 09:21:09 EST 2003

Maybe it's just me, but I've never been too pleased with BioPerl's
ability to handle large amounts of data like these unigene clusters.
You all might remember I recently proposed a FPC module for reading in
FPC data files.  Well, that is still in progress, but it is DOG slow,
and the only reason I can seem to make out of it is that object creation
is a bear.

I would really like some input myself, from the BioPerl experts about
what I can do to speed up the creation of say . . . 100k objects?  :-)

But, back to this question.  Yes, it will take forever + 1 day.  You
might consider this perl script instead.  It's pretty zippy.

==============================
#!/usr/local/bin/perl -w

my @query = qq{BG618921};
my $title;
my %lookup;

if ($#ARGV >= 0) {
  ## if there are arguments on the command line, use them as input
  @query = @ARGV;
}

## initialize a lookup HASH so that all values in the query are
## key entries with value of 1
@lookup{@query} = (1) x @query;

while (<STDIN>) {
   $title = $1      if (/^TITLE\s+(.*)/);   ## remember the title for
later
   if (/^SEQUENCE.+ACC=(\w+);/) {
     print "$1\t$title\n" if ($lookup{$1}); ## print out the title if it
matched
   }
}
============================

----------------------------------------------------------------------
Jamie Hatfield                              Room 541H, Marley Building
Systems Programmer                          University of Arizona
Arizona Genomics Computational              Tucson, AZ  85721
  Laboratory (AGCoL)                        (520) 626-9598

> -----Original Message-----
> From: bioperl-l-bounces at bioperl.org 
> [mailto:bioperl-l-bounces at bioperl.org] On Behalf Of darson
> Sent: Tuesday, March 25, 2003 12:39 AM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] Query Unigene title from input a ACC number
> 
> 
> Hello,
> 
> I'm trying to write a script to grab Unigene title from a 
> Hs.data file by
> input a ACC number,
> The following script is premature test,
> 
> use Bio::Cluster::UniGene; use Bio::ClusterIO; use Bio::ClusterI;
> $stream=Bio::ClusterIO->new('-file'=>"/home/human_unigene/Hs.data", #
> location of human unigene file from NCBI FTP
>                                                   
> '-format'=>"unigene");
> while (my $in=$stream->next_cluster()){
>      while (my $sequence=$in->next_seq()){
>           if ($sequence->accession_number()=~/BG618921/){ 
> #BG618921 is a ACC
> member of Hs.107 fibrinogen-like 1
>                print $hitid=$in->unigene_id()."\n";
>                print $hitti=$in->title()."\n";
>          }
>      }
> }
> 
> It can report the correct one, however this script spents 
> over 1 hour and
> more  to accomplish.  That's extremely low efficiency. 
> Furthermore I have
> thousands to do. I would be very appreciative if any 
> suggestions or other
> methods to solve my problems. Thanks!
>                     Best regards,
>                                                      Darson 
> Chung 2003/03/25
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>