[Bioperl-l] Suggestion for a new script: bp_repeat_mask_sequence.pl

Tue Jul 6 23:14:15 UTC 2010

Cheers Brian!

Do you think this script will work if I allow sequence and feature
'-format's to be picked by the user among all those listed as valid
file formats in BioPerl?

http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats
http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/FeatureIO.pm#SUPPORTED_FORMATS

Or should I stick to Fasta/GFF (or some other implementation)?

Cheers,
Dan.

On 6 July 2010 15:50, Brian Osborne <bosborne11 at verizon.net> wrote:
> Dan,
>
> There are 2 different directories for scripts, examples/ and scripts/. The examples/ directory accepts any sort of script. The scripts/ directory scripts can be installed when Bioperl is installed, if the user wishes. The guidelines are that scripts/ directory scripts should accept command-line arguments (which yours does), should be named with the suffix 'PLS', and should have POD documentation. So, all you need is some POD. Here's some example POD:
>
> =head1 NAME
>
> bioflat_index.pl - index sequence files using Bio::DB::Flat
>
> =head1 DESCRIPTION
>
>  Create or update a biological sequence database indexed with the
>  Bio::DB::Flat indexing scheme.  The arguments are a list of flat files
>  containing the sequence information to be indexed.
>
> =head1 USAGE
>
>  bioflat_index.pl <options> file1 file2 file3...
>
>  Options:
>
>     --create              Create or reinitialize the index.  If not specified,
>                           the index must already exist.
>
>     --format   <format>   The format of the sequence files.  Must be one
>                           of "genbank", "swissprot", "embl" or "fasta".
>
>     --location <path>     Path to the directory in which the index files
>                           are stored.
>
>     --dbname <name>       The symbolic name of the database to be created.
>
>     --indextype <type>    Type of index to create.  Either "bdb" or "flat".
>                           "binarysearch" is the same as "flat".
>
> Options can be abbreviated.  For example, use -i for --indextype.
>
> The following environment variables will be used as defaults if the
> corresponding options are not provided:
>
>     OBDA_FORMAT      format of sequence file
>     OBDA_LOCATION    path to directory in which index files are stored
>     OBDA_DBNAME      name of database
>     OBDA_INDEX       type of index to create
>
> =cut
>
>
> On Jul 6, 2010, at 2:37 PM, Dan Bolser wrote:
>
>> Hello,
>>
>> I'd like to submit a script, 'bp_repeat_mask_sequence.pl'(?), to the
>> set of scripts in BioPerl. Below is what I have so far. The script
>> works by reading in a GFF of 'repeat_region's and a fasta file of
>> sequences. It outputs a fasta sequence file with the repeats replaced
>> by Xs.
>>
>> The script clearly needs to be more configurable, but I thought I'd
>> send it along now to see if I'm working along the right lines, or if I
>> should be using a different approach.
>>
>> Comments?
>>
>>
>> Cheers,
>> Dan.
>>
>>
>>
>> #!/usr/bin/perl -w
>>
>> use strict;
>> use Getopt::Long;
>>
>> use Bio::SeqIO;
>> use Bio::FeatureIO;
>>
>>
>>
>> ## Set options
>>
>> my $verbose = 0;
>> my $seq_file;
>> my $gff_file;
>>
>> GetOptions
>>  (
>>   "verbose" => \$verbose,
>>   "seq=s" => \$seq_file,
>>   "gff=s" => \$gff_file,
>>  )
>>  or die "failed to parse command line options\n";
>>
>> die "fail $gff_file : $!\n"
>>  unless -s $gff_file;
>>
>>
>>
>> ## Set up the BioPerl objects
>>
>> my $seq_reader =
>>  Bio::SeqIO->new( -file => $seq_file,
>>                  -format => 'fasta'
>>                );
>>
>> my $seq_writer =
>>  Bio::SeqIO->new( -fh => \*STDOUT,
>>                  -format => 'fasta',
>>                  -width => 80
>>                );
>>
>> my $gff_reader =
>>  Bio::FeatureIO->new( -file => $gff_file,
>>                      -format => 'GFF',
>>                    );
>>
>> #warn $seq_reader->width, "\n"; exit;
>>
>>
>>
>> ## Run
>>
>> my (%repeats, $c);
>>
>> while ( my $feature = $gff_reader->next_feature() ) {
>>  if($verbose>1){
>>    print
>>      join("\t", #$feature,
>>          $feature->seq_id,
>>          $feature->type->name,
>>          $feature->start,
>>          $feature->end,
>>         ), "\n";
>>  }
>>
>>  if($feature->type->name eq 'repeat_region'){
>>    $c++;
>>    push @{$repeats{ $feature->seq_id }},
>>      [$feature->start,
>>       $feature->end];
>>  }
>>
>>  # Debugging
>>  #last if $c > 100;
>> }
>>
>> warn "read $c repeat_region features for ",
>>  scalar keys(%repeats), " sequences\n";
>>
>>
>>
>> ##
>>
>> while(my $seq = $seq_reader->next_seq){
>>  my $id = $seq->id;
>>  my $sequence = $seq->seq;
>>
>>  print $id, "\n"
>>    if $verbose > 0;
>>
>>  print length($sequence), "\n"
>>    if $verbose > 0;
>>
>>  for my $region (@{$repeats{ $id }}){
>>    my ($start, $end) = @$region;
>>    print "$start\t$end\n"
>>      if $verbose > 1;
>>
>>    substr($sequence, $start, $end - $start, 'X' x ($end - $start));
>>  }
>>
>>  print length($sequence), "\n"
>>    if $verbose > 0;
>>
>>  $seq->seq($sequence);
>>
>>  $seq_writer->write_seq($seq);
>>
>>  # Debugging;
>>  #last;
>> }
>>
>> warn "OK\n";
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>