Bioperl: RE:manipulating long strings (genomes) in PERL
Ian Korf
ikorf@sapiens.wustl.edu
Tue, 30 Mar 1999 10:53:33 -0600 (CST)
----------
X-Sun-Data-Type: text
X-Sun-Data-Description: text
X-Sun-Data-Name: text
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 17
I've attached a script called chromoShatter that probably does what you're
looking for. It breaks up chromosome fasta files into smaller chunks with
overlapping edges. The default is 10 Kb chunks with 100 bp overlap, but it's
a command line option. It creates two files:
1) a fasta database named something.genome
2) a lookup file named something.lookup
The fasta database is pretty self explanatory I think. The lookup file
contains the name of each chunk and its offset in the something.genome file.
The sequence is all on one line, so a single read gets it. The lookup
file is sorted so you can search it with a binary search such as the UNIX
look command. Alternatively, you can just put the names and offsets in a
tied HASH (dbmopen) and access that way. A bit more overhead, but faster.
If you need help with the dbmopen, or possibely some other related stuff,
this link may help you: http://sapiens.wustl.edu/~ikorf/perly/index.html
-Ian
----------
X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: chromoShatter
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 87
#!/usr/local/bin/perl -w
use strict;
use Getopt::Std;
use vars qw($opt_o $opt_s);
getopts('o:s:');
my $usage = '
CHROMOSHATTER - breaks a list of chromosome-sized fasta files into smaller
chunks and concatenates these into a fasta database. Progress is sent to
STDOUT, which should be sent to a .log file. Two files are created:
file.genome the concatenated fasta files (each chunk on one line)
file.lookup a list of each chunk and its offset in the .genome file
You can specify the size of each chunk and overlaps between chunks. Each entry
in the database specifies which chromosome and chunk number (and also the chunk
size and overlap). For example, chunk 57 on LG1 (with default chunksize and
overlap) would look like this:
>LG1#57 (10000:100)
usage: chromoShatter [options] <database name> <chrom1> <chrom2> <...>
options: default
-s size of each chunk 10000
-o overlap between each chunk 100
';
die $usage unless @ARGV > 1;
# Setup
my $size = $opt_s ? $opt_s : 10000;
my $overlap = $opt_o ? $opt_o : 100;
my ($DBNAME, @chrom) = @ARGV;
my @lookup; # store lookups for sorting later
foreach my $chrom (@chrom) {die "$chrom no found\n" unless -e $chrom}
open(GENOME, ">$DBNAME.genome") or die "cannot create $DBNAME.genome\n";
# Main loop
my $offset = 0;
foreach my $chrom (@chrom) {
open(CHROM, $chrom) or die "cannot open $chrom\n";
my $chunk = 0;
# let perl know how big I want this thing upfront to manage memory
my $seq = " " x ($size + $overlap + 1000);
$seq = "";
my $def = <CHROM>;
my $end_of_file = 0;
SEGMENT: {
my $line = <CHROM>;
unless (defined $line) {
$end_of_file = 1;
last SEGMENT;
}
chomp($line);
$seq .= $line;
last SEGMENT if (length($seq) > $size + $overlap);
redo SEGMENT;
}
# trim seq to proper size and save leftover
my $leftover = substr($seq, $size) unless $end_of_file;
$seq = substr($seq, 0, $size + $overlap);
# write def and seq to genome database
$def = ">$chrom#$chunk ($size:$overlap)\n";
print GENOME $def, $seq, "\n";
# write chrom + chunk to lookup
push @lookup, "$chrom#$chunk $offset\n";
# reset the offset and seq, increment chunk #
$offset += length($def) + length($seq) + 1;
$chunk++;
$seq = $leftover;
# get another segment or another chromosome
goto SEGMENT unless $end_of_file;
close CHROM;
}
close GENOME;
# sort lookup alphabetically
@lookup = sort {$a cmp $b} @lookup;
open(LOOKUP, ">$DBNAME.lookup") or die "cannot create $DBNAME.lookup\n";
print LOOKUP @lookup;
close LOOKUP;
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================