[Bioperl-l] Very basic Perl/BioPerl Help

Thu Apr 14 11:41:10 EDT 2005

Sean and Stefan,

FANTASTIC, I did have my accessions loaded up as arrays and this was EXACTLY
what I was looking for (the array comparison). The diff capability in
unix/linux will be nice when I get the final code going on the production
server... just on win32 for now. Thanks very much! 
Colin
-----Original Message-----
From: Sean Davis [mailto:sdavis2 at mail.nih.gov] 
Sent: Thursday, April 14, 2005 9:33 AM
To: Colin Erdman
Cc: bioperl-l at portal.open-bio.org
Subject: Re: [Bioperl-l] Very basic Perl/BioPerl Help

On Apr 14, 2005, at 11:03 AM, Colin Erdman wrote:

> Hello all,
>
>
>
> I certainly pounded away at this one last night, I thought this part 
> would
> be easy, but after spending so much time getting my Entrez gene data 
> parsed
> etc my brain was a bit rubbery.
>
>
>
> What I am trying to do is take either A) Two fasta files with 
> refseq/genbank
> data OR B) Two text files with 1 accession# per line and compare them,
> outputting only those fasta seqs or accession #'s that are not present 
> in
> both.
>
>             So is it easier to just use perl somehow to compare the 
> two raw
> acc# text files?
>
Colin,

If you load your text files as one array for each file, you can easily 
do what I think you are asking by looking here:

http://www.unix.org.ua/orelly/perl/cookbook/ch04_08.htm

> I just will need to match up those accession #'s NOT currently in our 
> list
> with the appropriate Entrez Genes using gene2accession, but I am not 
> sure
> how to do that either. I am assuming using a hash, but they have been 
> steep
> for me in terms of learning curve, but I'd like to learn them now, I 
> will
> just need some intuitive support.

Yep.  Hash will do it.  Read in your file grabbing the appropriate 
columns and putting them in a hash like:

my %acc2genehash;
while (my $line=<INF>) {
	my @params=split(/\t/,$line);
	$acc2genehash{$params[1]}=$params[5];
}

Then you can do:

print $acc2genehash{'AAD12597.1'}

will give you 1246500, the gene id of that accession (from the first 
line of gene2accession);

I haven't tested the above code, and you still need to do file loading, 
etc., but I hope you get the point.

Sean