[Biojava-l] Biojava-l Digest, Vol 104, Issue 6

Mon Sep 19 16:35:07 UTC 2011

Hi 

take a look at http://en.wikipedia.org/wiki/Levenshtein_distance

Regards,

khalil

On 19 Sep 2011, at 18:00, biojava-l-request at lists.open-bio.org wrote:

> Send Biojava-l mailing list submissions to
> 	biojava-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.open-bio.org/mailman/listinfo/biojava-l
> or, via email, send a message with subject or body 'help' to
> 	biojava-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
> 	biojava-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biojava-l digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: [Biojava-dev] A question about multiple alignment
>      (Andreas Prlic)
>   2. UniprotParser (Saif Ur-Rehman)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Sun, 18 Sep 2011 16:50:27 -0700
> From: Andreas Prlic <andreas at sdsc.edu>
> Subject: Re: [Biojava-l] [Biojava-dev] A question about multiple
> 	alignment
> To: Shahab Kamali <skamali at cs.uwaterloo.ca>
> Cc: biojava-l at biojava.org
> Message-ID:
> 	<CALthepxeBhoVSpzC3Yvu1_+15OurcEyeZsYAuX8qm1MNh-dXzQ at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi Shahab,
> 
> Sounds like you want to use an identity matrix for the alignment..
> 
> Andreas
> 
> On Sat, Sep 17, 2011 at 3:28 PM, Shahab Kamali <skamali at cs.uwaterloo.ca> wrote:
>> Thanks Andreas,
>> I want two components that have different names to have 0 alignment score.
>> My application is not about bio-compounds,so I can use anything else rather
>> than ProteinSequence and AminoAcidCompound. I just need to align sequences
>> of arbitrary alphabets. Could you suggest me a solution please?
>> Thanks a lot,
>> Shahab
>> 
>> Quoting Andreas Prlic <andreas at sdsc.edu>:
>> 
>>> Hi Shahab,
>>> 
>>> did you take a look at the substitution matrix, if it is scoring your
>>> sequences according to your expectation? Looks like in your
>>> theoretical example the alignment of B and D is favorable, i.e. it has
>>> a positive alignment score..
>>> 
>>> Andreas
>>> 
>>> 
>>> On Fri, Sep 16, 2011 at 10:56 AM, Shahab Kamali <skamali at cs.uwaterloo.ca>
>>> wrote:
>>>> 
>>>> Hi,
>>>> I am using BioJava in a pattern mining project. I want to align a set of
>>>> relatively short sequences. For example to align {"ABCE", "ABCE", "ADE",
>>>> "ADE").
>>>> 
>>>> This is a part of my code:
>>>> 
>>>> SubstitutionMatrix<AminoAcidCompound> matrix = new
>>>> ? ? ? ? ? ? ? ? ? ?SimpleSubstitutionMatrix<AminoAcidCompound>();
>>>> GuideTree<ProteinSequence, AminoAcidCompound> gt = new
>>>> GuideTree<ProteinSequence,
>>>> AminoAcidCompound>(lst,Alignments.getAllPairsScorers(lst,
>>>> ? ? ? ? ? ? ? ? ? Alignments.PairwiseSequenceScorerType.GLOBAL, ?new
>>>> ? ? ? ? ? ? ? ? ? SimpleGapPenalty((short)0,(short)0), matrix));
>>>> ? ? ? ? ? ?Profile<ProteinSequence, AminoAcidCompound> profile =
>>>> 
>>>> Alignments.getProgressiveAlignment(gt,Alignments.ProfileProfileAlignerType.GLOBAL,
>>>> new SimpleGapPenalty((short)0,(short)0),matrix);
>>>> 
>>>> The result of the above code is:
>>>> ABCE
>>>> ABCE
>>>> AD-E
>>>> AD-E
>>>> 
>>>> But what I need is
>>>> A-BCE
>>>> A-BCE
>>>> AD--E
>>>> AD--E
>>>> or
>>>> ABC-E
>>>> ABC-E
>>>> A--DE
>>>> A--DE
>>>> 
>>>> Do you have any suggestion?
>>>> Thanks,
>>>> Shahab
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 19 Sep 2011 11:09:46 +0100
> From: Saif Ur-Rehman <su24 at st-andrews.ac.uk>
> Subject: [Biojava-l] UniprotParser
> To: biojava-l at biojava.org
> Message-ID:
> 	<CABpZy=wUXJM42NVjmSetwX463hT+B5RLjwc2KP0R00rDiTYD-Q at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Dear all,
> 
> I am having issues with the BioJava UniProt parser as detailed below:
> 
> Code:
> 
> BufferedReader br = new BufferedReader(new FileReader( files[index]));
> Namespace ns = RichObjectFactory.getDefaultNamespace();
> RichSequenceIterator iterator = RichSequence.IOTools.readUniProt(br, ns);
> while(iterator.hasNext())
> {
> try
>               {
> RichSequence rs=iterator.nextRichSequence();
> }
> 
>              catch (NoSuchElementException e)
>               {
> 
> }
>               catch (BioException e)
>               {
>             e.printStackTrace();
> }
> 
> 
> 
> 
> The file I am using is downloaded from the link:
> 
> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_fungi.dat.gz
> 
> 
> The problem is that the parser works for a subset of the IDs within the file
> and on others throws an exception.
> 
> Sample Exception stack trace:
> 
> *** Start of trace *************************
> 
> at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> at uniprot.mp.main(mp.java:161)
> Caused by: org.biojava.bio.seq.io.ParseException:
> 
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l at biojava.org or post a bug
> report to http://bugzilla.open-bio.org/
> 
> Format_object=org.biojavax.bio.seq.io.UniProtFormat
> Accession=P53031
> Id=
> Comments=
> Parse_block=RN   [1]RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].RC   STRAIN=NCYC
> 2512;RX   MEDLINE=97082501; PubMed=8923737;
> DOI=10.1002/(SICI)1097-0061(199610)12:13<1321::AID-YEA27>3.0.CO;2-6;RA
> Rodriguez P.L., Ali R., Serrano R.;RT   "CtCdc55p and CtHa13p: two putative
> regulatory proteins from Candida
> tropicalis with long acidic domains.";RL   Yeast 12:1321-1329(1996).
> Stack trace follows ....
> 
> 
> at
> org.biojavax.bio.seq.io.UniProtFormat.readRichSequence(UniProtFormat.java:615)
> at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
> ... 1 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.biojavax.bio.seq.io.UniProtFormat.readRichSequence(UniProtFormat.java:486)
> ... 2 more
> org.biojava.bio.BioException: Could not read sequence
> at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
> at uniprot.mp.main(mp.java:161)
> Caused by: org.biojava.bio.seq.io.ParseException: Name has not been supplied
> 
> ********End of trace**********************************
> 
> An example of an Id that worked is:
> 
> ZYM1_SCHPO
> 
> while an ID that didn't work is:
> 
> ZUO1_YEAST
> 
> Thanks a lot in advance.
> 
> Cheers,
> Saif
> 
> 
> -- 
> Saif Ur-Rehman
> 
> Centre for Evolution, Genes and Genomics
> Harold Mitchell Building
> University of St Andrews
> St Andrews
> Fife
> KY16 9TH
> UK
> 
> Tel: +44 131 5572556
> Fax: +44 1334 463366
> 
> 
> ------------------------------
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> 
> End of Biojava-l Digest, Vol 104, Issue 6
> *****************************************