[Bioperl-l] New to BioPerl - A little presentation... and a question about GenBank and Bioperl

Olivier BUHARD Olivier.Buhard at inserm.fr
Thu Mar 6 11:05:56 UTC 2014


Hello,


I'm new to BioPerl and would like to ask you for a few advice about the 
use of Bioperl.

I am a molecular biologist and I frequently use Perl to write scripts to 
prepare or analyse files I get from various databases, so I'm familiar 
enough with Perl.
We work in my lab on a particular type of tumorigenic process called 
MSI, for MicroSatellite Instability. I'll not go through all the story 
but a hallmark of the associated cancers is that the size of their 
genomic repeated DNA sequences spread throughout the genome, is altered.
Up to now, we got a list of those sequences from a collaboration who 
could make that for us. But now the list we have is old and we have to 
get this information by our own means and naturally I started looking at 
Bioperl.
And before I go through learning all I need (which I guess, will take 
some time), I will really appreciate if someone could tell me if I 
Bioperl can help from start to end.

In summary, I plan to search all the short repetitive sequences (I'm 
just interested in human genome at the moment) I can find in the Genbank 
flat file provided by the NCBI FTP site. The idea is to create a BioSQL 
database (I already installed using a schema for mySQL) that I could 
query using an appropriate algorithm.
I saw Bioperl is made to read those files with multiple entries. So 
building the BioSQL database would not be a problem. My first question 
is about how I will crawl through the genomic sequences to detect short 
repeat tandem sequences of defined size and patterns (some are 
mononucleotides repeats, like (A)27, other could be dinucleotides 
redpeats like (CA)12, etc.). BLAST is not design for such a job... Are 
there some tools already available in Bioperl to deal with low 
complexity DNA in general and short tandem repeats in particular, 
something like repeatmasker or windowmasker but with a different kind of 
output? I'm interested in retrieving some of the features provided with 
the genbank format (find repeats in coding or non-coding regions, get 
their position in the genes or the transcripts with respect to exon 
position, intron-exon proximity...).

I also have a more direct and "practical" question. I just tried a few 
sample codes provided in the beginners' toturials on the Bioperl site. I 
just ran the following on the gbpri1.seq provided on the NCBI FTP but I 
got some errors and warnings for many (but not all) sequences.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $seqio_obj = Bio::SeqIO->new(-file => "<$seq_file", -format => 
'genbank' );
while (my $seq_obj = $seqio_obj->next_seq()){
     print $seq_obj->display_id,"\n";
}

This is what I get for AB000095 locus:

Replacement list is longer than search list at 
C:/Perl/site/lib/Bio/Range.pm lin
e 251.
UNIVERSAL->import is deprecated and will be removed in a future perl at 
C:/Perl/
site/lib/Bio/Tree/TreeFunctionsI.pm line 94
Subroutine new redefined at C:/Perl/site/lib/Bio\Location\Simple.pm line 
93, <GE
N0> line 41.
Subroutine start redefined at C:/Perl/site/lib/Bio\Location\Simple.pm 
line 115,
<GEN0> line 41.
Subroutine end redefined at C:/Perl/site/lib/Bio\Location\Simple.pm line 
144, <G
EN0> line 41.
Subroutine length redefined at C:/Perl/site/lib/Bio\Location\Simple.pm 
line 190,
  <GEN0> line 41.
Subroutine location_type redefined at 
C:/Perl/site/lib/Bio\Location\Simple.pm li
ne 281, <GEN0> line 41.
Subroutine to_FTstring redefined at 
C:/Perl/site/lib/Bio\Location\Simple.pm line
  328, <GEN0> line 41.
Subroutine trunc redefined at C:/Perl/site/lib/Bio\Location\Simple.pm 
line 370,
<GEN0> line 41.
AB000095

But when I remove the shebang option -w... the warnings disappear.
(I use ActivePerl 5.14.2 on a Windows XP computer. I had the idea that 
shebang was not used under Windows, but it seems tat's wrong here...
Is that due to some problem about my Perl installation, or is it 
Bio::SeqIO code related?

Thank in advance for any answer.

Kind regards

-- 

--------------------

BUHARD Olivier

"Instabilité de microsatellites et cancer"
Centre de Recherche Saint Antoine
équipe 11/INSERM UMRS 938
Bâtiment Kourilsky,
Hôpital Saint Antoine
34 rue Crozatier
75012 PARIS





More information about the Bioperl-l mailing list