[Bioperl-l] how to remove indentical sequences from a dataset

Diego Mauricio Riano Pachon diriano at uni-potsdam.de
Tue Aug 5 10:28:58 UTC 2008


Hi all,

Or you might try a non-bioperl solution that works pretty well, check:

http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86

Best,

Diego

Bernd Web wrote:
> Hi,
> 
> There is a BioPerl Utility script doing this.
> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities header.
> 
> " scripts/utilities/bp_nrdb.PLS
>     Make a non-redundant database based on sequence, not id. Requires
> Digest::MD5."
> 
> Alternatively, you can make a hash using the sequences as keys.
> 
> 
> Regards,
> Bernd
> 
> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao at gmail.com> wrote:
>> Hi, there ,
>>
>> I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules  which can remove those identical sequences?
>>
>> thanks a lot.
>> yours,
>> shaohua
>> ----- Original Message -----
>> From: "Benbo" <btemperton at googlemail.com>
>> To: <Bioperl-l at lists.open-bio.org>
>> Sent: Sunday, August 03, 2008 4:05 AM
>> Subject: [Bioperl-l] Finding possible primers regex
>>
>>
>>> Hi there,
>>> I'm trying to write a perl script to scan an aligned multiple entry fasta
>>> file and find possible primers. So far I've produced a string which contains
>>> bases which match all sequences and * where they don't match e.g.
>>> 1) TTAGCCTAA
>>> 2) TTAGCAGAA
>>> 3) TTACCCTAA
>>>
>>> would give TTA*C**AA.
>>>
>>> I want to parse this string and pull out all sequences which are 18-21 bp in
>>> length and have no more than 4 * in them.
>>>
>>> So far, I've got this:
>>>
>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>> print "$1\n";
>>> }
>>>
>>> hoping to match all fragments 18-21 characters in length. However even that
>>> doesn't work as it has essentially chunked it into 21 char blocks, rather
>>> than what I hoped for of
>>> 0-18
>>> 0-19
>>> 0-20
>>> 0-21
>>> 1-19
>>> 1-20
>>> 1-21
>>> 1-22
>>>
>>> etc.
>>>
>>> Can anyone let me know if this is already possible in BioPerl, or how one
>>> would go about it with regex. Sadly I'm fairly new to perl and getting to
>>> grips with BioPerl, so please treat me gently :).
>>>
>>> Many thanks,
>>>
>>> Ben
>>>
>>>
>>>
>>> --
>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 


-- 
___________________________________
Diego Mauricio Riaño Pachón
Biologist - PhD student
AG Mueller-Roeber
Institute for Biochemistry and Biology
University of Potsdam

Address: Karl-Liebknecht-Str. 24-25
	 Haus 20
	 14476 Golm
	 Germany

Tel:	 +49 331 977 2809
Fax:	 +49 331 977 2512

web:	http://www.geocities.com/dmrp.geo



More information about the Bioperl-l mailing list