[Bioperl-l] how to remove indentical sequences from a dataset

Tue Aug 5 15:19:54 UTC 2008

Here are two links which go into detail (the last is a specific  
implementation):

http://en.wikipedia.org/wiki/Sequence_clustering
http://www.bioinformatics.org/cd-hit/

chris

On Aug 5, 2008, at 5:28 AM, Diego Mauricio Riano Pachon wrote:

> Hi all,
>
> Or you might try a non-bioperl solution that works pretty well, check:
>
> http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86
>
> Best,
>
> Diego
>
> Bernd Web wrote:
>> Hi,
>> There is a BioPerl Utility script doing this.
>> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities  
>> header.
>> " scripts/utilities/bp_nrdb.PLS
>>    Make a non-redundant database based on sequence, not id. Requires
>> Digest::MD5."
>> Alternatively, you can make a hash using the sequences as keys.
>> Regards,
>> Bernd
>> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao at gmail.com>  
>> wrote:
>>> Hi, there ,
>>>
>>> I have a sequence dataset which contains about 200 sequences.  
>>> there are some identical sequences in this. is there any bioperl  
>>> modules  which can remove those identical sequences?
>>>
>>> thanks a lot.
>>> yours,
>>> shaohua
>>> ----- Original Message -----
>>> From: "Benbo" <btemperton at googlemail.com>
>>> To: <Bioperl-l at lists.open-bio.org>
>>> Sent: Sunday, August 03, 2008 4:05 AM
>>> Subject: [Bioperl-l] Finding possible primers regex
>>>
>>>
>>>> Hi there,
>>>> I'm trying to write a perl script to scan an aligned multiple  
>>>> entry fasta
>>>> file and find possible primers. So far I've produced a string  
>>>> which contains
>>>> bases which match all sequences and * where they don't match e.g.
>>>> 1) TTAGCCTAA
>>>> 2) TTAGCAGAA
>>>> 3) TTACCCTAA
>>>>
>>>> would give TTA*C**AA.
>>>>
>>>> I want to parse this string and pull out all sequences which are  
>>>> 18-21 bp in
>>>> length and have no more than 4 * in them.
>>>>
>>>> So far, I've got this:
>>>>
>>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>>> print "$1\n";
>>>> }
>>>>
>>>> hoping to match all fragments 18-21 characters in length. However  
>>>> even that
>>>> doesn't work as it has essentially chunked it into 21 char  
>>>> blocks, rather
>>>> than what I hoped for of
>>>> 0-18
>>>> 0-19
>>>> 0-20
>>>> 0-21
>>>> 1-19
>>>> 1-20
>>>> 1-21
>>>> 1-22
>>>>
>>>> etc.
>>>>
>>>> Can anyone let me know if this is already possible in BioPerl, or  
>>>> how one
>>>> would go about it with regex. Sadly I'm fairly new to perl and  
>>>> getting to
>>>> grips with BioPerl, so please treat me gently :).
>>>>
>>>> Many thanks,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> -- 
> ___________________________________
> Diego Mauricio Riaño Pachón
> Biologist - PhD student
> AG Mueller-Roeber
> Institute for Biochemistry and Biology
> University of Potsdam
>
> Address: Karl-Liebknecht-Str. 24-25
> 	 Haus 20
> 	 14476 Golm
> 	 Germany
>
> Tel:	 +49 331 977 2809
> Fax:	 +49 331 977 2512
>
> web:	http://www.geocities.com/dmrp.geo
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign