[Bioperl-l] Bioperl/Ensemble - getting 50.000 instead of 20.000 genes

Tue Jan 18 09:59:11 UTC 2011

Hello community,

I am a student working on a project. For this project i need the locations
of all genes in human and other genomes.
I posted my script below and it seems to work fine and do the things which i
need.
I now get the following results for human:

[quote]
ENSEMBLE DATA IMPORTER

Amount of genes on chr 1: 5232
Amount of genes on chr 2: 3938
Amount of genes on chr 3: 2979
Amount of genes on chr 4: 2544
Amount of genes on chr 5: 2829
Amount of genes on chr 6: 2870
Amount of genes on chr 7: 2895
Amount of genes on chr 8: 2247
Amount of genes on chr 9: 2372
Amount of genes on chr 10: 2245
Amount of genes on chr 11: 2149
Amount of genes on chr 12: 1786
Amount of genes on chr 13: 1207
Amount of genes on chr 14: 1523
Amount of genes on chr 15: 1329
Amount of genes on chr 16: 1430
Amount of genes on chr 17: 2109
Amount of genes on chr 18: 581
Amount of genes on chr 19: 2082
Amount of genes on chr 20: 1288
Amount of genes on chr 21: 705
Amount of genes on chr 22: 1221
Amount of genes on chr X: 2357
Amount of genes on chr Y: 561
Amount of total genes: 50479
[/quote]

I am just wondering because as far as I know human has currently 21,077
genes which you can see here:
http://www.ensembl.org/Homo_sapiens/Info/StatsTable?db=core
Basicaly for each chromosome i get too many genes, but the gene amount seems
to be in proportion. (For instance chromosome 18 does not have many genes)

Basicaly this code sums up how i get the data:
[code]
my $species = 'Human';
my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
my $genes = $slice->get_all_Genes();
[/code]
I found an option in the ENSEMBLE Api
(http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html)
which additionally filters the results: 
[code]
$geneisknown = $gene->is_known();
[/code]
With this option i still get more than 30000 genes. 
All the genes i get have different IDs, so they obviously exist and there
are no duplicates, i checked this by using a hash.
So the big question is, why are there that many genes, are those really
genes, and if not, how can i get those 21077 listed genes instead of my
50000+ genes

I would appreciate any help or answer.

Daniel

(The entire code of the script:)
[code]
#!/usr/bin/perl
use strict;

use lib "src/ensembl/modules";
use lib "src/bioperl-life";

print "ENSEMBLE DATA IMPORTER\n\n";

#---------------------------------------------------------
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my @db_adaptors = @{ $registry->get_all_DBAdaptors() };

#----------------------------------------------------------
#get_all_Genes()
#my %testhash = ();

my $genecounter=0;  
my $species = 'Human';
my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
my $j=0;
for(my $v=1;$v<=24;$v++){
  if($v==23){$v='X';};
  if($v==24){$v='Y';};

  my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
  my $genes = $slice->get_all_Genes();

  my $i =0;

  while ( my $gene = shift @{$genes} ) {
    $i++;
   #PARAMETER WHICH HAVE TO BE INPORTED TO THE LOCAL DATABASE

   my $gene_id = $gene->stable_id();
   my $gene_start=  $gene->start();
   my $gene_end =   $gene->end();
   my $gene_length = $gene_end-$gene_start;
   my $strand     = $gene->strand();
#     my @exons = @{ $gene->get_all_Exons() };
#     my $exon_amount = @exons;
#     my $total_exon_length=0;
#     my $chromosome = $v;
#     my @transcripts =@{$gene->get_all_Transcripts()};
 # $testhash{ $gene_id } = 1;  
     #my $geneisknown = $gene->is_known();

# if($geneisknown){
#   print "known\n";
#   $genecounter++;
# }else{
#   print "not known\n";
# }

#  foreach my $exon (@exons){
#    my $exon_start=  $exon->start();
#    my $exon_end=  $exon->end();
#    my $exon_length = $exon_end-$exon_start;
#    $total_exon_length = $total_exon_length+$exon_length;
#  }   

 #print "$gene_id: Position:$gene_start-$gene_end\tLength:$gene_length
\tStrand: $strand\tExons:$exon_amount \t Total Exon Length:
$total_exon_length\tChr:$chromosome\tSpecies:$species\n ";
 # TODO splicing variants of transcripts---------
#      foreach my $transcript (@transcripts){
#        my $transcript_id=  $transcript->stable_id();
#        my $transcript_start=  $transcript->start();
#        my $transcript_end=  $transcript->end();
#        my $transcript_length = $transcript_end-$transcript_start;
#        #print "Transcript: $transcript_id\t$transcript_length\n";
#      }   
# ----------------------------------------------
  }
  $j=$j+$i;
  print "Amount of genes on chr $v: $i\n";
  if($v eq 'X'){$v=23;};
  if($v eq 'Y'){$v=24;};
}
print "Amount of total genes: $j\n";
#print "size of hash:  " . keys( %testhash ) . ".\n";

#print "Total amount of known genes: $genecounter\n."

[/code]
-- 
View this message in context: http://old.nabble.com/Bioperl-Ensemble---getting-50.000-instead-of-20.000-genes-tp30698664p30698664.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.