[Bioperl-l] Bioperl/Ensemble - getting 50.000 instead of 20.000 genes
Daniel_N
dnasseh at googlemail.com
Tue Jan 18 09:59:11 UTC 2011
Hello community,
I am a student working on a project. For this project i need the locations
of all genes in human and other genomes.
I posted my script below and it seems to work fine and do the things which i
need.
I now get the following results for human:
[quote]
ENSEMBLE DATA IMPORTER
Amount of genes on chr 1: 5232
Amount of genes on chr 2: 3938
Amount of genes on chr 3: 2979
Amount of genes on chr 4: 2544
Amount of genes on chr 5: 2829
Amount of genes on chr 6: 2870
Amount of genes on chr 7: 2895
Amount of genes on chr 8: 2247
Amount of genes on chr 9: 2372
Amount of genes on chr 10: 2245
Amount of genes on chr 11: 2149
Amount of genes on chr 12: 1786
Amount of genes on chr 13: 1207
Amount of genes on chr 14: 1523
Amount of genes on chr 15: 1329
Amount of genes on chr 16: 1430
Amount of genes on chr 17: 2109
Amount of genes on chr 18: 581
Amount of genes on chr 19: 2082
Amount of genes on chr 20: 1288
Amount of genes on chr 21: 705
Amount of genes on chr 22: 1221
Amount of genes on chr X: 2357
Amount of genes on chr Y: 561
Amount of total genes: 50479
[/quote]
I am just wondering because as far as I know human has currently 21,077
genes which you can see here:
http://www.ensembl.org/Homo_sapiens/Info/StatsTable?db=core
Basicaly for each chromosome i get too many genes, but the gene amount seems
to be in proportion. (For instance chromosome 18 does not have many genes)
Basicaly this code sums up how i get the data:
[code]
my $species = 'Human';
my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
my $genes = $slice->get_all_Genes();
[/code]
I found an option in the ENSEMBLE Api
(http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html)
which additionally filters the results:
[code]
$geneisknown = $gene->is_known();
[/code]
With this option i still get more than 30000 genes.
All the genes i get have different IDs, so they obviously exist and there
are no duplicates, i checked this by using a hash.
So the big question is, why are there that many genes, are those really
genes, and if not, how can i get those 21077 listed genes instead of my
50000+ genes
I would appreciate any help or answer.
Daniel
(The entire code of the script:)
[code]
#!/usr/bin/perl
use strict;
use lib "src/ensembl/modules";
use lib "src/bioperl-life";
print "ENSEMBLE DATA IMPORTER\n\n";
#---------------------------------------------------------
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
my @db_adaptors = @{ $registry->get_all_DBAdaptors() };
#----------------------------------------------------------
#get_all_Genes()
#my %testhash = ();
my $genecounter=0;
my $species = 'Human';
my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice' );
my $j=0;
for(my $v=1;$v<=24;$v++){
if($v==23){$v='X';};
if($v==24){$v='Y';};
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $v);
my $genes = $slice->get_all_Genes();
my $i =0;
while ( my $gene = shift @{$genes} ) {
$i++;
#PARAMETER WHICH HAVE TO BE INPORTED TO THE LOCAL DATABASE
my $gene_id = $gene->stable_id();
my $gene_start= $gene->start();
my $gene_end = $gene->end();
my $gene_length = $gene_end-$gene_start;
my $strand = $gene->strand();
# my @exons = @{ $gene->get_all_Exons() };
# my $exon_amount = @exons;
# my $total_exon_length=0;
# my $chromosome = $v;
# my @transcripts =@{$gene->get_all_Transcripts()};
# $testhash{ $gene_id } = 1;
#my $geneisknown = $gene->is_known();
# if($geneisknown){
# print "known\n";
# $genecounter++;
# }else{
# print "not known\n";
# }
# foreach my $exon (@exons){
# my $exon_start= $exon->start();
# my $exon_end= $exon->end();
# my $exon_length = $exon_end-$exon_start;
# $total_exon_length = $total_exon_length+$exon_length;
# }
#print "$gene_id: Position:$gene_start-$gene_end\tLength:$gene_length
\tStrand: $strand\tExons:$exon_amount \t Total Exon Length:
$total_exon_length\tChr:$chromosome\tSpecies:$species\n ";
# TODO splicing variants of transcripts---------
# foreach my $transcript (@transcripts){
# my $transcript_id= $transcript->stable_id();
# my $transcript_start= $transcript->start();
# my $transcript_end= $transcript->end();
# my $transcript_length = $transcript_end-$transcript_start;
# #print "Transcript: $transcript_id\t$transcript_length\n";
# }
# ----------------------------------------------
}
$j=$j+$i;
print "Amount of genes on chr $v: $i\n";
if($v eq 'X'){$v=23;};
if($v eq 'Y'){$v=24;};
}
print "Amount of total genes: $j\n";
#print "size of hash: " . keys( %testhash ) . ".\n";
#print "Total amount of known genes: $genecounter\n."
[/code]
--
View this message in context: http://old.nabble.com/Bioperl-Ensemble---getting-50.000-instead-of-20.000-genes-tp30698664p30698664.html
Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
More information about the Bioperl-l
mailing list