[Bioperl-l] Bioperl/Ensemble - getting 50.000 instead of 20.000 genes

Fri Jan 21 15:35:03 UTC 2011

Daniel,

This question should be asked on the ensembl-dev mailing list, not here.

http://uswest.ensembl.org/info/about/contact/mailing.html

chris

On Jan 18, 2011, at 3:59 AM, Daniel_N wrote:

> 
> Hello community,
> 
> I am a student working on a project. For this project i need the locations
> of all genes in human and other genomes.
> I posted my script below and it seems to work fine and do the things which i
> need.
> I now get the following results for human:
> 
> [quote]
> ENSEMBLE DATA IMPORTER
> 
> Amount of genes on chr 1: 5232
> Amount of genes on chr 2: 3938
> Amount of genes on chr 3: 2979
> Amount of genes on chr 4: 2544
> Amount of genes on chr 5: 2829
> Amount of genes on chr 6: 2870
> Amount of genes on chr 7: 2895
> Amount of genes on chr 8: 2247
> Amount of genes on chr 9: 2372
> Amount of genes on chr 10: 2245
> Amount of genes on chr 11: 2149
> Amount of genes on chr 12: 1786
> Amount of genes on chr 13: 1207
> Amount of genes on chr 14: 1523
> Amount of genes on chr 15: 1329
> Amount of genes on chr 16: 1430
> Amount of genes on chr 17: 2109
> Amount of genes on chr 18: 581
> Amount of genes on chr 19: 2082
> Amount of genes on chr 20: 1288
> Amount of genes on chr 21: 705
> Amount of genes on chr 22: 1221
> Amount of genes on chr X: 2357
> Amount of genes on chr Y: 561
> Amount of total genes: 50479
> [/quote]
> 
> I am just wondering because as far as I know human has currently 21,077
> genes which you can see here:
> http://www.ensembl.org/Homo_sapiens/Info/StatsTable?db=core
> Basicaly for each chromosome i get too many genes, but the gene amount seems
> to be in proportion. (For instance chromosome 18 does not have many genes)
> 
> Basicaly this code sums up how i get the data:
> [code]
> my $species = 'Human';
> my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
> my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
> my $genes = $slice->get_all_Genes();
> [/code]
> I found an option in the ENSEMBLE Api
> (http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html)
> which additionally filters the results: 
> [code]
> $geneisknown = $gene->is_known();
> [/code]
> With this option i still get more than 30000 genes. 
> All the genes i get have different IDs, so they obviously exist and there
> are no duplicates, i checked this by using a hash.
> So the big question is, why are there that many genes, are those really
> genes, and if not, how can i get those 21077 listed genes instead of my
> 50000+ genes
> 
> I would appreciate any help or answer.
> 
> Daniel
> 
> (The entire code of the script:)
> [code]
> #!/usr/bin/perl
> use strict;
> 
> use lib "src/ensembl/modules";
> use lib "src/bioperl-life";
> 
> 
> print "ENSEMBLE DATA IMPORTER\n\n";
> 
> #---------------------------------------------------------
> use Bio::EnsEMBL::Registry;
> 
> my $registry = 'Bio::EnsEMBL::Registry';
> 
> $registry->load_registry_from_db(
>    -host => 'ensembldb.ensembl.org',
>    -user => 'anonymous'
> );
> 
> my @db_adaptors = @{ $registry->get_all_DBAdaptors() };
> 
> #----------------------------------------------------------
> #get_all_Genes()
> #my %testhash = ();
> 
> my $genecounter=0;  
> my $species = 'Human';
> my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
> my $j=0;
> for(my $v=1;$v<=24;$v++){
>  if($v==23){$v='X';};
>  if($v==24){$v='Y';};
> 
>  my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
>  my $genes = $slice->get_all_Genes();
> 
>  my $i =0;
> 
>  while ( my $gene = shift @{$genes} ) {
>    $i++;
>   #PARAMETER WHICH HAVE TO BE INPORTED TO THE LOCAL DATABASE
> 
>   my $gene_id = $gene->stable_id();
>   my $gene_start=  $gene->start();
>   my $gene_end =   $gene->end();
>   my $gene_length = $gene_end-$gene_start;
>   my $strand     = $gene->strand();
> #     my @exons = @{ $gene->get_all_Exons() };
> #     my $exon_amount = @exons;
> #     my $total_exon_length=0;
> #     my $chromosome = $v;
> #     my @transcripts =@{$gene->get_all_Transcripts()};
> # $testhash{ $gene_id } = 1;  
>     #my $geneisknown = $gene->is_known();
> 
> # if($geneisknown){
> #   print "known\n";
> #   $genecounter++;
> # }else{
> #   print "not known\n";
> # }
> 
> #  foreach my $exon (@exons){
> #    my $exon_start=  $exon->start();
> #    my $exon_end=  $exon->end();
> #    my $exon_length = $exon_end-$exon_start;
> #    $total_exon_length = $total_exon_length+$exon_length;
> #  }   
> 
> 
> 
> #print "$gene_id: Position:$gene_start-$gene_end\tLength:$gene_length
> \tStrand: $strand\tExons:$exon_amount \t Total Exon Length:
> $total_exon_length\tChr:$chromosome\tSpecies:$species\n ";
> # TODO splicing variants of transcripts---------
> #      foreach my $transcript (@transcripts){
> #        my $transcript_id=  $transcript->stable_id();
> #        my $transcript_start=  $transcript->start();
> #        my $transcript_end=  $transcript->end();
> #        my $transcript_length = $transcript_end-$transcript_start;
> #        #print "Transcript: $transcript_id\t$transcript_length\n";
> #      }   
> # ----------------------------------------------
>  }
>  $j=$j+$i;
>  print "Amount of genes on chr $v: $i\n";
>  if($v eq 'X'){$v=23;};
>  if($v eq 'Y'){$v=24;};
> }
> print "Amount of total genes: $j\n";
> #print "size of hash:  " . keys( %testhash ) . ".\n";
> 
> 
> #print "Total amount of known genes: $genecounter\n."
> 
> 
> 
> [/code]
> -- 
> View this message in context: http://old.nabble.com/Bioperl-Ensemble---getting-50.000-instead-of-20.000-genes-tp30698664p30698664.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l