[Bioperl-l] Bioperl/Ensemble - getting 50.000 instead of 20.000 genes
Chris Fields
cjfields at illinois.edu
Fri Jan 21 15:35:03 UTC 2011
Daniel,
This question should be asked on the ensembl-dev mailing list, not here.
http://uswest.ensembl.org/info/about/contact/mailing.html
chris
On Jan 18, 2011, at 3:59 AM, Daniel_N wrote:
>
> Hello community,
>
> I am a student working on a project. For this project i need the locations
> of all genes in human and other genomes.
> I posted my script below and it seems to work fine and do the things which i
> need.
> I now get the following results for human:
>
> [quote]
> ENSEMBLE DATA IMPORTER
>
> Amount of genes on chr 1: 5232
> Amount of genes on chr 2: 3938
> Amount of genes on chr 3: 2979
> Amount of genes on chr 4: 2544
> Amount of genes on chr 5: 2829
> Amount of genes on chr 6: 2870
> Amount of genes on chr 7: 2895
> Amount of genes on chr 8: 2247
> Amount of genes on chr 9: 2372
> Amount of genes on chr 10: 2245
> Amount of genes on chr 11: 2149
> Amount of genes on chr 12: 1786
> Amount of genes on chr 13: 1207
> Amount of genes on chr 14: 1523
> Amount of genes on chr 15: 1329
> Amount of genes on chr 16: 1430
> Amount of genes on chr 17: 2109
> Amount of genes on chr 18: 581
> Amount of genes on chr 19: 2082
> Amount of genes on chr 20: 1288
> Amount of genes on chr 21: 705
> Amount of genes on chr 22: 1221
> Amount of genes on chr X: 2357
> Amount of genes on chr Y: 561
> Amount of total genes: 50479
> [/quote]
>
> I am just wondering because as far as I know human has currently 21,077
> genes which you can see here:
> http://www.ensembl.org/Homo_sapiens/Info/StatsTable?db=core
> Basicaly for each chromosome i get too many genes, but the gene amount seems
> to be in proportion. (For instance chromosome 18 does not have many genes)
>
> Basicaly this code sums up how i get the data:
> [code]
> my $species = 'Human';
> my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
> my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
> my $genes = $slice->get_all_Genes();
> [/code]
> I found an option in the ENSEMBLE Api
> (http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html)
> which additionally filters the results:
> [code]
> $geneisknown = $gene->is_known();
> [/code]
> With this option i still get more than 30000 genes.
> All the genes i get have different IDs, so they obviously exist and there
> are no duplicates, i checked this by using a hash.
> So the big question is, why are there that many genes, are those really
> genes, and if not, how can i get those 21077 listed genes instead of my
> 50000+ genes
>
> I would appreciate any help or answer.
>
> Daniel
>
> (The entire code of the script:)
> [code]
> #!/usr/bin/perl
> use strict;
>
> use lib "src/ensembl/modules";
> use lib "src/bioperl-life";
>
>
> print "ENSEMBLE DATA IMPORTER\n\n";
>
> #---------------------------------------------------------
> use Bio::EnsEMBL::Registry;
>
> my $registry = 'Bio::EnsEMBL::Registry';
>
> $registry->load_registry_from_db(
> -host => 'ensembldb.ensembl.org',
> -user => 'anonymous'
> );
>
> my @db_adaptors = @{ $registry->get_all_DBAdaptors() };
>
> #----------------------------------------------------------
> #get_all_Genes()
> #my %testhash = ();
>
> my $genecounter=0;
> my $species = 'Human';
> my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
> my $j=0;
> for(my $v=1;$v<=24;$v++){
> if($v==23){$v='X';};
> if($v==24){$v='Y';};
>
> my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
> my $genes = $slice->get_all_Genes();
>
> my $i =0;
>
> while ( my $gene = shift @{$genes} ) {
> $i++;
> #PARAMETER WHICH HAVE TO BE INPORTED TO THE LOCAL DATABASE
>
> my $gene_id = $gene->stable_id();
> my $gene_start= $gene->start();
> my $gene_end = $gene->end();
> my $gene_length = $gene_end-$gene_start;
> my $strand = $gene->strand();
> # my @exons = @{ $gene->get_all_Exons() };
> # my $exon_amount = @exons;
> # my $total_exon_length=0;
> # my $chromosome = $v;
> # my @transcripts =@{$gene->get_all_Transcripts()};
> # $testhash{ $gene_id } = 1;
> #my $geneisknown = $gene->is_known();
>
> # if($geneisknown){
> # print "known\n";
> # $genecounter++;
> # }else{
> # print "not known\n";
> # }
>
> # foreach my $exon (@exons){
> # my $exon_start= $exon->start();
> # my $exon_end= $exon->end();
> # my $exon_length = $exon_end-$exon_start;
> # $total_exon_length = $total_exon_length+$exon_length;
> # }
>
>
>
> #print "$gene_id: Position:$gene_start-$gene_end\tLength:$gene_length
> \tStrand: $strand\tExons:$exon_amount \t Total Exon Length:
> $total_exon_length\tChr:$chromosome\tSpecies:$species\n ";
> # TODO splicing variants of transcripts---------
> # foreach my $transcript (@transcripts){
> # my $transcript_id= $transcript->stable_id();
> # my $transcript_start= $transcript->start();
> # my $transcript_end= $transcript->end();
> # my $transcript_length = $transcript_end-$transcript_start;
> # #print "Transcript: $transcript_id\t$transcript_length\n";
> # }
> # ----------------------------------------------
> }
> $j=$j+$i;
> print "Amount of genes on chr $v: $i\n";
> if($v eq 'X'){$v=23;};
> if($v eq 'Y'){$v=24;};
> }
> print "Amount of total genes: $j\n";
> #print "size of hash: " . keys( %testhash ) . ".\n";
>
>
> #print "Total amount of known genes: $genecounter\n."
>
>
>
> [/code]
> --
> View this message in context: http://old.nabble.com/Bioperl-Ensemble---getting-50.000-instead-of-20.000-genes-tp30698664p30698664.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list