[Bioperl-l] Assistance with a BioPerl/Perl project

Sean Davis sdavis2 at mail.nih.gov
Thu Mar 24 18:10:08 EST 2005


If I understood you correctly, you are starting with a list of genbank 
accession numbers?  If you start with, for example, CR407631:

Go to:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=&DB=unigene

and type in that accession.

You will see the resulting Unigene entry and after one click to get 
details you will be at this page:

http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=2

There is a small "links" link just under the search bar.  Normally, you 
can link from there to Gene (but it appears to be broken at the 
moment).  In any case, with the file from below, you can look up a 
unigene id and get the Entrez Gene (if there is one) entry.  The nice 
thing about using Unigene is that there is no blasting involved at all. 
  What you end up with is an Entrez Gene (and bonus Unigene id) 
associated with your accession (most of the time, but some will not be 
in Unigene for various reasons).  You can then mine Gene for whatever 
information you want to assign to the accessions.  For that, you will 
need either a gene parser (from sourceforge) or just use the 
tab-delimited text files from the Gene/DATA ftp site noted in my 
previous email to get the information you want.


--------------------------------
Now, if you really want the easy way to do the above, go to:

http://genome-www5.stanford.edu/cgi-bin/source/sourceBatchSearch

Here, just paste in your accessions and get whatever information back 
you want--very nice site for this.  (They still call it LocusLink ID, 
but that is a Gene ID as well).

Hope this helps.
Sean


On Mar 24, 2005, at 4:46 PM, Colin Erdman wrote:

> So in effect, this is just as good as taking the actual nucleotide 
> sequences
> (derived using a GenBank lookup) from my static accession number list 
> and
> running them through the 'member sequences' of my genes (clusters) of
> interest in order to see if any new gene products or information have 
> been
> added for that sequence? And where would you suspect that BLASTN will 
> then
> fit into the scheme. I apologize for the redundancy, there is just so 
> much
> to take in!
>
> Thanks,
> Colin
>
> -----Original Message-----
> From: Sean Davis [mailto:sdavis2 at mail.nih.gov]
> Sent: Thursday, March 24, 2005 11:50 AM
> To: Colin Erdman
> Cc: bioperl-l at portal.open-bio.org
> Subject: Re: [Bioperl-l] Assistance with a BioPerl/Perl project
>
> If you are starting with Genbank Accession numbers and want to get to
> Entrez Gene, the "standard" way to do that is to use Unigene.  If you
> go to the Entrez website and choose the Unigene database, you can type
> in your accession and you will be taken to a unigene record.  If you
> click on the "links" section, you can then link to Entrez Gene.
>
> To do this in batch mode, I download Hs.data.gz from NCBI at:
>
> ftp://ftp.ncbi.nih.gov/repository/UniGene/
>
> Then, you can use Bio::ClusterIO to parse Unigene.  Grab the
> accession_number part of each sequence (there is an example of doing
> this in the POD documentation).  You can then make a hash like:
>
> push(@{$acc_hash{$acc}},$in->unigene_id};
>
> which maps accessions to unigene ids.
>
> Make a second hash that maps unigene to gene using the file:
>
> ftp://ftp.ncbi.nih.gov/gene/DATA/gene2unigene
>
> which will map the unigene ids to gene.
>
> Then, you have the information you need to map from accession to gene
> via unigene.
>
> Just a note on Entrez Gene:  the Gene does not represent a sequence,
> but instead a set of sequences.  The sequences are Refseq sequences.
> So, you wouldn't be blasting against "Gene" per say, but against the
> one or several Refseq sequences (if there are any) that represent the
> Gene.
>
> Hope this helps.  Standard disclaimer:  as with perl AND
> bioinformatics, there is more than one way to do this.  And keep in
> mind that Entrez Gene is only one source of annotation; for chromosome
> 21, there may be other sites that have more information, specifically
> Ensembl.
>
> Sean
>
>
> On Mar 24, 2005, at 12:54 PM, Colin Erdman wrote:
>
>> Hello list,
>>
>>
>>
>> I am a 22 year old bioinformatics and molecular biology major at the
>> University of Denver. I just accepted a position with a researcher
>> here, and
>> already have a first assignment. We are working on a comprehensive
>> chromosome 21 gene database and map and my first task is to update a
>> list of
>> known (and curated) Human chromosome 21 genes. I have become rapidly
>> familiar with BioPerl however my adviser needs me to use Entrez Gene 
>> to
>> compare the currently known Chr 21 genes (from query: '21[CHR] AND 
>> Homo
>> sapiens[ORGN] AND NOT Pseudogene' ) with a list of genes that she has
>> provided in xls and xml format.
>>
>> The idea is to take the accession numbers in the provided files, pull
>> the
>> nucleotide sequence from them, and run those against the sequences for
>> records found with the Entrez Gene query in order to find any newly
>> annotated/(discovered/elucidated?) genes for that sequence. I am
>> familiar
>> with the current problem of BioPerl not directly being able to parse
>> the
>> EntrezGene object, but have played with the Bio::SeqIO::Gene2accession
>> (&
>> geneinfo) and the egparser. My programming skills are not completely
>> up to
>> par, so egparser is tough for me to grasp. Bio::SeqIO::Gene2accession
>> is
>> more intuitive, however I am having a terrible time figuring out how 
>> to
>> convert my desired entrezgene results into the legacy gene_info and
>> gene2accession formats? Any suggestions are greatly appreciated, I am
>> very
>> new at this, so very simple coding examples and explanations help and
>> are
>> the best way for me to learn.
>>
>>
>>
>> Thanks all!
>>
>> colin
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>



More information about the Bioperl-l mailing list