[Bioperl-l] Extracting patterns

Mariana Mondragon mmondrag@ea.oac.uci.edu
Mon, 20 Aug 2001 19:03:25 -0700 (PDT)


Hi everyone,

I have a list of amino acid sequences in FASTA format, with the script
pasted below I would like to obtain a list of the sequence IDs and lengths
of every sequence as well as sum of all the sequence lengths. Like this:

f11m1513: 572
F24D78: 967
T12P1811: 1032
TOTAL LENGTH = 2571

However I am obtaining something like this:
:2671
TOTAL LENGTH= 2671

This is part of the exercises I am using to learn Perl. In order to fix
the problem I have made changes on the counters or in the order of
variable declaration, but this does not seem to work. I have written the
original script I got from the chaper 17 "Using perl to facilitate
biological analysis" from the book Bioinformatics by Baxevanis A. and
Ouellette B.F.

Hope any of you can shed some light on this. Thanks in advance.

M. Mondragon

**********************************************************************
THE SCRIPT:

#!/usr/bin/perl

$id='';                          #holds sequence ID of current sequence
$length=0;                       #holds length of current sequence
$total_length=0;                 #tallies aggregate lenght of all seqs
while (<>)
{chomp;
if (/^>(\S+)$/)                  #found a new description line
 {print "$id:$length\n" if $length>0;
 $1=$id;
 $length=0;}

else
 {$length +=length;
  $total_length +=length;}
 }
print "$id:$length\n" if $length>0;   #last entry
print "TOTAL LENGTH= $total_length\n";

******************************************************************************
THE DATA

>f11m1513
MSTDELLTFDHVDIRFPIELNKQGSCSLNLTNKTDNYV
AFKAQTTKPKMYCVKPSVGVVLPRSSCEVLVVMQALKE
APADRQCKDKLLFQCKVVEPGTMDKEVTSEMFSKEAGH
RVEETIFKIIYVAPPQPQSPVQEGLEDGSSPSASVSDK
GNASEVFVGPSVGIVDLIRMSDELLIIDPVDVQFPIEL
NKKVSCSLNLTNKTENYVAFKAKTTNAKKYYVRPNVGV
VLPRSSCEVLVIMQALKEAPADMQCRDKLLFQCKVVEP
ETTAKDVTSEMFSKEAGHPAEETRLKVMYVTPPQPPSP
VQEGTEEGSSPRASVSDNGNASEAFVDMLRSLLVPLFS
NAASSTDDHGITLPQYQVFINFRGDELRNSFVGFLVKA
MRLEKINVFTDEVELRGTNLNYLFRRIEESRVAVAIFS
ERYTESCWCLDELVKMKEQMEQGKLVVVPVFYRLNATA
CKRFMGAFGDNLRNLEWEYRSEPERIQKWKEALSSVFS
NIGLTSDIRRYNLINKNMDHTSEFLYIVLILNFFSEIS
DMTGLTTSYQFLLMMKSNLISYDIYIYPTKFCVNVFIG
V*

>F24D78
MASSSSSPRTWRYRVFTSFHGPDVRKTVLSHLRKQFIC
NGITMFDDQRIERGQTISPELTRGIRESRISIVVLSKN
YASSSWCLDELLEILKCKEDIGQIVMTVFYGVDPSDVR
KQTGEFGIRFSETWARKTEEEKQKWSQALNDVGNIAGE
HFLNWDKESKMVETIARDVSNKLNTTISKDFEDMVGIE
AHLQKMQSLLHLDNEDEAMIVGICGPSGIGKTTIARAL
HSRLSSSFQLTCFMENLKGSYNSGLDEYGLKLCLQQQL
LSKILNQNDLRIFHLGAIPERLCDQNVLIILDGVDDLQ
QLEALTNETSWFGPGSRIIVTTEDQELLEQHDINNTYH
VDFPTIKEARKIFCRSAFRQSSAPYGFEKLVERVLKLC
SNLPLGLRVMGSSLRRKKEDDWESILHRQENSLDRKIE
GVLRVGYDNLHKNDQFLFLLIAFFFNYQDNDHVKAMLG
DSKLDVRYGLKTLAYKSLIQISIKGDIVMHKLLQQVGK
EAVQRQDHGKRQILIDSDEICDVLENDSGNRNVMGISF
DISTLLNDVYISAEAFKRIRNLRFLSIYKTRLDTNVRL
HLSEDMVFPPQLRLLHWEVYPGKSLPHTFRPEYLVELN
LRDNQLEKLWEGIQPLTNLKKMELLRSSNLKVLPNLSD
ATNLEVLNLALCESLVEIPPSIGNLHKLEKLIMDFCRK
LKVVPTHFNLASLESLGMMGCWQLKNIPDISTNITTLK
ITDTMLEDLPQSIRLWSGLQVLDIYGSVNIYHAPAEIY
LEGRGADIKKIPDCIKDLDGLKELHIYGCPKIVSLPEL
PSSLKRLIVDTCESLETLVHFPFESAIEDLYFSNCFKL
GQEARRVITKQSRDAWLPGRNVPAEFHYRAVGNSLTIP
TDTYECRICVVISPKQKMVEFFDLLCRQRKNGFSTGQK
RLQLLPKVQAEHLFIGHFTLSDKLDSGVLLEFSTSSKD
IDIIECGIQIFHGHYR*

>T12P1811
MSLMDSPSSISSCNYRFNVFSSFHGPNVRKTLLSHMRK
QFNFNGITMFDDQGIERSEEIVPSLKKAIKESRISIVI
LSKKYALSRWCLDELVEILKCKEVMGHIVMTIFYGVEP
SDVRKQTGEFGFHFNETCAHRTDEDKQNWSKALKDVGN
IAGEDFLRWDNEAKMIEKIARDVSDKLNATPSRDFNGM
VGLEAHLTEMESLLDLDYDGVKMVGISGPAGIGKTTIA
RALQSRLSNKFQLTCFVDNLKESFLNSLDELRLQEQFL
AKVLNHDGIRICHSGVIEERLCKQRVLIILDDVNHIMQ
LEALANETTWFGSGSRIVVTTENKEILQQHGINDLYHV
GFPSDEQAFEILCRYAFRKTTLSHGFEKLARRVTKLCG
NLPLGLRVLGSSLRGKNEEEWEEVIRRLETILDHQDIE
EVLRVGYGSLHENEQSLFLHIAVFFNYTDGDLVKAMFT
DNNLDIKHGLKILADKSLINISNNREIVIHKLLQQFGR
QAVHKEEPWKHKILIHAPEICDVLEYATGTKAMSGISF
DISGVDEVVISGKSFKRIPNLRFLKVFKSRDDGNDRVH
IPEETEFPRRLRLLHWEAYPCKSLPPTFQPQYLVELYM
PSSQLEKLWEGTQRLTHLKKMNLFASRHLKELPDLSNA
TNLERMDLSYCESLVEIPSSFSHLHKLEWLEMNNCINL
QVIPAHMNLASLETVNMRGCSRLRNIPVMSTNITQLYV
SRTAVEGMPPSIRFCSRLERLSISSSGKLKGITHLPIS
LKQLDLIDSDIETIPECIKSLHLLYILNLSGCRRLASL
PELPSSLRFLMADDCESLETVFCPLNTPKAELNFTNCF
KLGQQAQRAIVQRSLLLGTTLLPGRELPAEFDHQGKGN
TLTIRPGTGFVVCIVISPNLASQITEYRLPQLLCRRRI
GQGDLDPIEKVFNVRTLLNFQTEHLFVFIIHPHLPFID
PSEVSREIVFEFSSKFNHFDVIDCGAKFLTDGSIKGSY
DSGLEQVFEDNTKHGDHADCWNWLFHCFDLPHFVKNVR
SFVSV*











@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>@>
Mariana Mondragon-Palomino
321 Steinhaus Hall
Department of Ecology and Evolutionary Biology
University of California
Irvine, CA 92697-2525

mmondrag@uci.edu
Office 418
Ph.# 824-7703