[Bioperl-l] extracting oriented subsequences

Heikki Lehvaslaiho heikki@ebi.ac.uk
Wed, 15 Aug 2001 10:43:58 +0100


This is a multi-part message in MIME format.
--------------A22A4B0F9B9436D7ED42CCBA
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


Vladimir,

Attached is a file for extracting a subsequence from an accession number and
location (I use it for SNP+flanking sequence generation) from command line.
It should be easy enough to modify it to read the input from a file (open()
and a while loop over lines) in a slightly different format.

	-Heikki


V.Kuryshev@dkfz-heidelberg.de wrote:
> 
> Dear All,
> 
> I am new in this mail-list and hope that my request is in topic of interest.
> 
> I am looking for the program (script) which can help to extract from one or
> number of sequences (fasta) some subsequences using their coordinates from
> separate text file like:
> 
> ACnnnnn1  12001 13000
> ACnnnnn1  14209 15787
> ACnnnnn3  1009  15787
> ACnnnn11  429   5787
> .....
> 
> and collect them in new fasta-file.
> 
> Important thing is that orientation of subsequences should follow the rule :
> from small number to biger (if it's written ACnnnn 456 345 it means that
> corresponded subsequence will be written in new file as reverce complement to
> original sequence)
> 
> E.g. it can be useful if you make assembling from draft(HTGS) genomic sequences
> (contain unordered pieces).
> 
> I've just started to learn Perl for bioinformatics purposes and will be glad to
> hear any advices to improve this studing from Internet.
> 
> Thank you in advance.
> 
> Vladimir
> 
> Molecular Genome Analysis
> German Cancer Research Center
> Im Neuenheimer Feld 506
> Lab 056
> D-69120 Heidelberg
> Germany
> Phone:  +49 6221 42 4701
> Fax:    +49 6221 42 4704
> e-mail: V.Kuryshev@dkfz.de
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________
--------------A22A4B0F9B9436D7ED42CCBA
Content-Type: text/plain; charset=us-ascii;
 name="flanks"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="flanks"

#!/usr/bin/perl -- # -*-Perl-*-
#
# Heikki Lehvaslaiho <heikki@ebi.ac.uk>
# Finding flanking sequences for a SNP.
# 
# Requires BioPerl > 0.6.1, possibly 0.7 (latest at the moment)
#
#   v. 1     16 Mar 2001
#   v. 1.1   9 Aug 2001 interface change, more info in fasta header 
#


use Bio::PrimarySeq;
use Bio::SeqIO;
use Bio::DB::EMBL;
use strict;

my $flank;                    # Will default to 50
my ($file, $pos, $strand);
my $in_format = 'EMBL';       # format of the local file on disk
my $out_format = 'FASTA';     # output format, going into STDOUT

if (@ARGV != 2 and @ARGV != 3 and @ARGV != 4) {
    print "Usage: flanks (accession or filename) position [length] [strand]\n";
    exit;
}

$file = shift;
$pos = shift;
$flank = shift;
$strand = shift;

$strand ||= 1;
$flank ||= 50;          # defaults to 50

print STDERR "Are you sure you did not want length",
    " [$flank] to be assigned for stand" if $flank == 1 or $flank == -1;

my ($seq, $out_seq);

if (-e $file ) {
    my $in  = Bio::SeqIO->new('-file' => $file,
			      '-format' => $in_format);
    $seq = $in->next_seq();
} else {
    my $gb = new Bio::DB::EMBL;
    $seq = $gb->get_Seq_by_acc('$file');
}

die "ERROR: Position '$pos' not in sequence"
    if $pos < 1 or $pos > $seq->length;
my $out = Bio::SeqIO->new('-format' => $out_format);

my $five_start = $pos - $flank;
$five_start = 1 if $five_start < 1;

my $three_end = $pos + $flank;
$three_end = $seq->length if $pos + $flank > $seq->length;

my $five_prime = lc $seq->subseq($five_start , $pos - 1);
my $snp = uc $seq->subseq($pos, $pos);
my $three_prime = lc $seq->subseq($pos + 1, $three_end); 
my $locpos = length($five_prime) + 1;

my $fastaid = "$file oripos=$pos strand=$strand allelepos=$locpos";

if (defined $strand and $strand == -1) {
    my $five_prime_seq = new Bio::PrimarySeq(-seq=>$five_prime);
    my $snp_seq = new Bio::PrimarySeq(-seq=>$snp);
    my $three_prime_seq = new Bio::PrimarySeq(-seq=>$three_prime);

    my $str = $three_prime_seq->revcom->seq. " ".
	$snp_seq->revcom->seq. " ". $five_prime_seq->revcom->seq;
    #print STDERR  $str, "\n";
    $str =~ s/ //g;
    $out_seq = new Bio::PrimarySeq (-id => $fastaid,
				    -seq => $str );
} else {
    my $str = $five_prime. " ". $snp. " ". $three_prime;
    #print STDERR $str, "\n";
    $str =~ s/ //g;
    $out_seq = new Bio::PrimarySeq (-id => $fastaid,
				    -seq => $str );
}

$out->write_seq($out_seq);




--------------A22A4B0F9B9436D7ED42CCBA--