[Bioperl-l] Fwd: Fwd: gene extraction script

Adam Witney awitney at sgul.ac.uk
Wed Jul 21 09:43:35 UTC 2010


forwarding to the original poster and the list...


Begin forwarded message:

> From: Laurent MANCHON <lmanchon at univ-montp2.fr>
> Date: 21 July 2010 10:05:16 GMT+01:00
> To: Adam Witney <awitney at sgul.ac.uk>
> Subject: Re: [Bioperl-l] Fwd: gene extraction script
> Reply-To: lmanchon at univ-montp2.fr
> 
> another program you can use.
> 
> 
> #!/usr/bin/perl
> # retrieve sequences from genbank
> # read list of GenBank access number from a file and write to output.txt file
> 
> use LWP::UserAgent;
> use Bio::SeqIO;
> use Bio::DB::GenBank;
> $cpt=0;
> $outfile = "outfile.txt";
> chomp($outfile);
> $gb = new Bio::DB::GenBank;
> while (<>) {
>     ($Fld1) = split(' ', $_, 9999);
> $out = Bio::SeqIO->new('-file' => ">>$outfile",'-format' => 'fasta');
> $acc = $Fld1;
> $cpt++;
> print $acc,"\t$cpt","  retrieved\n";
> $seqio = $gb->get_Stream_by_acc([$acc]);
> while( $seq = $seqio->next_seq() ) {
> open(LOG,">>$outfile");
> printf LOG '%s ', ">" . $acc;
> $out->write_seq($seq);
> close(LOG);
> }
> 
> 
> 
> 
> 
> Adam Witney a écrit :
>> 
>> Hi Nicholas,
>> 
>> If you just want to retrieve sequences from GenBank, try this:
>> 
>> http://github.com/bioperl/bioperl-live/blob/master/examples/db/getGenBank.pl
>> 
>> HTH
>> 
>> adam
>> 
>> 
>> On 20 Jul 2010, at 23:05, Florent Angly wrote:
>> 
>>   
>>> -------- Original Message --------
>>> Subject: 	gene extraction script
>>> Date: 	Tue, 20 Jul 2010 15:56:36 +0100
>>> From: 	Nicholas Brown <n.brown at nhm.ac.uk>
>>> To: 	<florent.angly at gmail.com>
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> Sorry to distrub, I'm currently working at the natural history museum on some DNA sequencing techniques and I came accross a script you posted ( see below) which I think maybe helpful. The problem I have is that I have lots of sequnces of the same Mitochondrial genomes of different species aligned and I would like to extract the genes from each of the sequences easily, I wondered if there was an easy modificaiton to your script that I could perform that would allow me to do this? The sequences for the genes are already submitted to genbank, so i was just wondering if this would be a quicker simplier way than doing it manually?
>>> Thanks so much for your time in reading this.
>>> 
>>> Kind Regards,
>>> 
>>> Nicholas Brown
>>> 
>>> Script
>>> http://github.com/bioperl/bioperl-live/blob/master/examples/tools/extract_genes.pl<https://webmail.nhm.ac.uk/exchweb/bin/redir.asp?URL=http://github.com/bioperl/bioperl-live/blob/master/examples/tools/extract_genes.pl>
>>> #!/usr/bin/perl -w
>>> # $Id$
>>> =pod
>>> 
>>> 
>>> =head1 NAME
>>> 
>>> 
>>> extract_genes.pl - extract genomic sequences from NCBI files
>>> using BioPerl
>>> 
>>> 
>>> =head1 DESCRIPTION
>>> 
>>> 
>>> This script is a simple solution to the problem of
>>> extracting genomic regions corresponding to genes. There are other
>>> solutions, this particular approach uses genomic sequence
>>> files from NCBI and gene coordinates from Entrez Gene.
>>> 
>>> 
>>> The first time this script is run it will be slow as it will
>>> extract species-specific data from the gene2accession file and create
>>> a storable hash (retrieving the positional data from this hash is
>>> significantly faster than reading gene2accession each time the script
>>> runs). The subsequent runs should be fast.
>>> 
>>> 
>>> =head1 INSTALLATION
>>> 
>>> 
>>> =head2
>>> 
>>> 
>>> Install BioPerl, full instructions at http://bioperl.org.
>>> 
>>> 
>>> =head2 Download gene2accession.gz
>>> 
>>> 
>>> Download this file from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA into
>>> your working directory and gunzip it.
>>> 
>>> 
>>> =head2 Download sequence files
>>> 
>>> 
>>> Create one or more species directories in the working directory, the
>>> directory names do not have to match those at NCBI (e.g. "Sc", "Hs").
>>> 
>>> 
>>> Download the nucleotide fasta files for a given species from its CHR*
>>> directories at ftp://ftp.ncbi.nlm.nih.gov/genomes and put these files into a
>>> species directory. The sequence files will have the suffix ".fna" or
>>> "fa.gz", gunzip if necessary.
>>> 
>>> 
>>> =head2 Determine Taxon id
>>> 
>>> 
>>> Determine the taxon id for the given species. This id is the first column
>>> in the gene2accession file. Modify the %species hash in this script
>>> such that name of your species directory is a key and the taxon id is the
>>> value.
>>> 
>>> 
>>> =head2 Command-line options
>>> 
>>> 
>>>  -i   Gene id
>>>  -s   Name of species directory
>>>  -h   Help
>>> 
>>> 
>>> Example:
>>> 
>>> 
>>>  extract_genes.pl -i 850302 -s Sc
>>> 
>>> 
>>> =cut
>>> 
>>> 
>>> use strict;
>>> use Bio::DB::Fasta;
>>> use Getopt::Long;
>>> use Storable;
>>> 
>>> my %species = ( "Sc" =>  4932,  # Saccharomyces cerevisiae
>>> 				     "Ec" =>  83333, # Escherichia coli K12
>>> 					  "Hs" =>  9606   # H. sapiens
>>> 				   );
>>> 
>>> my ($help,$id,$name);
>>> 
>>> GetOptions( "s=s"  =>   \$name,
>>>            "i=i"  =>   \$id,
>>> 				"h"    =>   \$help );
>>> 
>>> usage() if ($help || !$id || !$name);
>>> 
>>> my $storedHash = $name . ".dump";
>>> 
>>> # create index for a directory of fasta files
>>> my $db = Bio::DB::Fasta->new($name, -makeid =>  \&make_my_id);
>>> 
>>> # extract species-specific data from gene2accession
>>> unless (-e $storedHash) {
>>> 	my $ref;
>>> 	# extract species-specific information from gene2accession
>>> 	open MYIN,"gene2accession" or die "No gene2accession file\n";
>>> 	while (<MYIN>) {
>>> 		my @arr = split "\t",$_;
>>> 		if ($arr[0] == $species{$name}&&  $arr[9] =~ /\d+/&&  $arr[10] =~ /\d+/) {
>>> 			($ref->{$arr[1]}->{"start"}, $ref->{$arr[1]}->{"end"},
>>> 			 $ref->{$arr[1]}->{"strand"}, $ref->{$arr[1]}->{"id"}) =	
>>> 				($arr[9], $arr[10], $arr[11], $arr[7]);
>>> 		}
>>> 	}
>>> 	# save species-specific information using Storable
>>> 	store $ref, $storedHash;
>>> }
>>> 
>>> # retrieve the species-specific data from a stored hash
>>> my $ref = retrieve($storedHash);
>>> 
>>> # retrieve sequence and sub-sequence
>>> if (defined $ref->{$id}) {
>>> 	my $chr = $db->get_Seq_by_id($ref->{$id}->{"id"});
>>> 	my $seq = $chr->trunc($ref->{$id}->{"start"},$ref->{$id}->{"end"});
>>> 	$seq = $seq->revcom if ($ref->{$id}->{"strand"} eq "-");
>>> 
>>> 	# Insert SeqIO options here...
>>> 	print $seq->seq,"\n";
>>> } else {
>>> 	print "Cannot find id: $id\n";
>>> }
>>> 
>>> sub make_my_id {
>>> 	my $line = shift;
>>> 	$line =~ /ref\|([^|]+)/;
>>> 	$1;
>>> }
>>> 
>>> sub usage {
>>> 	system "perldoc $0";
>>> 	exit;
>>> }
>>> 
>>> __END__
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>     
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>   
> 
> 





More information about the Bioperl-l mailing list