[Biojava-l] Biojava-l Digest, Vol 104, Issue 7

JAX jayunit100 at gmail.com
Tue Sep 20 16:09:29 UTC 2011


pairwise similarity is better than levenstein for short sequences..... Just count the total number of matching letter pairs, divided by the length of the longest string between the two words.  There is a great article about this online called "How to strike a match".

We used it for the sequence mining here, and were able to find important homologs and reproduce known results :
http://jb.asm.org/cgi/content/short/JB.00018-11v1

Jay Vyas 
MMSB
UCHC

On Sep 20, 2011, at 12:00 PM, biojava-l-request at lists.open-bio.org wrote:

> Send Biojava-l mailing list submissions to
>    biojava-l at lists.open-bio.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>    http://lists.open-bio.org/mailman/listinfo/biojava-l
> or, via email, send a message with subject or body 'help' to
>    biojava-l-request at lists.open-bio.org
> 
> You can reach the person managing the list at
>    biojava-l-owner at lists.open-bio.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biojava-l digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Biojava-l Digest, Vol 104, Issue 6 (Khalil El Mazouari)
>   2. why can't biojava fold RNA? (quan zou)
>   3. Re: why can't biojava fold RNA? (Andreas Prlic)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 19 Sep 2011 18:35:07 +0200
> From: Khalil El Mazouari <khalil.elmazouari at gmail.com>
> Subject: Re: [Biojava-l] Biojava-l Digest, Vol 104, Issue 6
> To: biojava-l at lists.open-bio.org
> Message-ID: <B79797BF-D30D-450B-9606-B44F70EFF5BA at gmail.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi 
> 
> take a look at http://en.wikipedia.org/wiki/Levenshtein_distance
> 
> Regards,
> 
> khalil
> 
> 
> 
> On 19 Sep 2011, at 18:00, biojava-l-request at lists.open-bio.org wrote:
> 
>> Send Biojava-l mailing list submissions to
>>    biojava-l at lists.open-bio.org
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>>    http://lists.open-bio.org/mailman/listinfo/biojava-l
>> or, via email, send a message with subject or body 'help' to
>>    biojava-l-request at lists.open-bio.org
>> 
>> You can reach the person managing the list at
>>    biojava-l-owner at lists.open-bio.org
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Biojava-l digest..."
>> 
>> 
>> Today's Topics:
>> 
>>  1. Re: [Biojava-dev] A question about multiple alignment
>>     (Andreas Prlic)
>>  2. UniprotParser (Saif Ur-Rehman)
>> 
>> 
>> ----------------------------------------------------------------------
>> 
>> Message: 1
>> Date: Sun, 18 Sep 2011 16:50:27 -0700
>> From: Andreas Prlic <andreas at sdsc.edu>
>> Subject: Re: [Biojava-l] [Biojava-dev] A question about multiple
>>    alignment
>> To: Shahab Kamali <skamali at cs.uwaterloo.ca>
>> Cc: biojava-l at biojava.org
>> Message-ID:
>>    <CALthepxeBhoVSpzC3Yvu1_+15OurcEyeZsYAuX8qm1MNh-dXzQ at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> Hi Shahab,
>> 
>> Sounds like you want to use an identity matrix for the alignment..
>> 
>> Andreas
>> 
>> On Sat, Sep 17, 2011 at 3:28 PM, Shahab Kamali <skamali at cs.uwaterloo.ca> wrote:
>>> Thanks Andreas,
>>> I want two components that have different names to have 0 alignment score.
>>> My application is not about bio-compounds,so I can use anything else rather
>>> than ProteinSequence and AminoAcidCompound. I just need to align sequences
>>> of arbitrary alphabets. Could you suggest me a solution please?
>>> Thanks a lot,
>>> Shahab
>>> 
>>> Quoting Andreas Prlic <andreas at sdsc.edu>:
>>> 
>>>> Hi Shahab,
>>>> 
>>>> did you take a look at the substitution matrix, if it is scoring your
>>>> sequences according to your expectation? Looks like in your
>>>> theoretical example the alignment of B and D is favorable, i.e. it has
>>>> a positive alignment score..
>>>> 
>>>> Andreas
>>>> 
>>>> 
>>>> On Fri, Sep 16, 2011 at 10:56 AM, Shahab Kamali <skamali at cs.uwaterloo.ca>
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> I am using BioJava in a pattern mining project. I want to align a set of
>>>>> relatively short sequences. For example to align {"ABCE", "ABCE", "ADE",
>>>>> "ADE").
>>>>> 
>>>>> This is a part of my code:
>>>>> 
>>>>> SubstitutionMatrix<AminoAcidCompound> matrix = new
>>>>> ? ? ? ? ? ? ? ? ? ?SimpleSubstitutionMatrix<AminoAcidCompound>();
>>>>> GuideTree<ProteinSequence, AminoAcidCompound> gt = new
>>>>> GuideTree<ProteinSequence,
>>>>> AminoAcidCompound>(lst,Alignments.getAllPairsScorers(lst,
>>>>> ? ? ? ? ? ? ? ? ? Alignments.PairwiseSequenceScorerType.GLOBAL, ?new
>>>>> ? ? ? ? ? ? ? ? ? SimpleGapPenalty((short)0,(short)0), matrix));
>>>>> ? ? ? ? ? ?Profile<ProteinSequence, AminoAcidCompound> profile =
>>>>> 
>>>>> Alignments.getProgressiveAlignment(gt,Alignments.ProfileProfileAlignerType.GLOBAL,
>>>>> new SimpleGapPenalty((short)0,(short)0),matrix);
>>>>> 
>>>>> The result of the above code is:
>>>>> ABCE
>>>>> ABCE
>>>>> AD-E
>>>>> AD-E
>>>>> 
>>>>> But what I need is
>>>>> A-BCE
>>>>> A-BCE
>>>>> AD--E
>>>>> AD--E
>>>>> or
>>>>> ABC-E
>>>>> ABC-E
>>>>> A--DE
>>>>> A--DE
>>>>> 
>>>>> Do you have any suggestion?
>>>>> Thanks,
>>>>> Shahab
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> ------------------------------
>> 
>> Message: 2
>> Date: Mon, 19 Sep 2011 11:09:46 +0100
>> From: Saif Ur-Rehman <su24 at st-andrews.ac.uk>
>> Subject: [Biojava-l] UniprotParser
>> To: biojava-l at biojava.org
>> Message-ID:
>>    <CABpZy=wUXJM42NVjmSetwX463hT+B5RLjwc2KP0R00rDiTYD-Q at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>> 
>> Dear all,
>> 
>> I am having issues with the BioJava UniProt parser as detailed below:
>> 
>> Code:
>> 
>> BufferedReader br = new BufferedReader(new FileReader( files[index]));
>> Namespace ns = RichObjectFactory.getDefaultNamespace();
>> RichSequenceIterator iterator = RichSequence.IOTools.readUniProt(br, ns);
>> while(iterator.hasNext())
>> {
>> try
>>              {
>> RichSequence rs=iterator.nextRichSequence();
>> }
>> 
>>             catch (NoSuchElementException e)
>>              {
>> 
>> }
>>              catch (BioException e)
>>              {
>>            e.printStackTrace();
>> }
>> 
>> 
>> 
>> 
>> The file I am using is downloaded from the link:
>> 
>> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_fungi.dat.gz
>> 
>> 
>> The problem is that the parser works for a subset of the IDs within the file
>> and on others throws an exception.
>> 
>> Sample Exception stack trace:
>> 
>> *** Start of trace *************************
>> 
>> at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>> at uniprot.mp.main(mp.java:161)
>> Caused by: org.biojava.bio.seq.io.ParseException:
>> 
>> A Exception Has Occurred During Parsing.
>> Please submit the details that follow to biojava-l at biojava.org or post a bug
>> report to http://bugzilla.open-bio.org/
>> 
>> Format_object=org.biojavax.bio.seq.io.UniProtFormat
>> Accession=P53031
>> Id=
>> Comments=
>> Parse_block=RN   [1]RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].RC   STRAIN=NCYC
>> 2512;RX   MEDLINE=97082501; PubMed=8923737;
>> DOI=10.1002/(SICI)1097-0061(199610)12:13<1321::AID-YEA27>3.0.CO;2-6;RA
>> Rodriguez P.L., Ali R., Serrano R.;RT   "CtCdc55p and CtHa13p: two putative
>> regulatory proteins from Candida
>> tropicalis with long acidic domains.";RL   Yeast 12:1321-1329(1996).
>> Stack trace follows ....
>> 
>> 
>> at
>> org.biojavax.bio.seq.io.UniProtFormat.readRichSequence(UniProtFormat.java:615)
>> at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>> ... 1 more
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>> at
>> org.biojavax.bio.seq.io.UniProtFormat.readRichSequence(UniProtFormat.java:486)
>> ... 2 more
>> org.biojava.bio.BioException: Could not read sequence
>> at
>> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>> at uniprot.mp.main(mp.java:161)
>> Caused by: org.biojava.bio.seq.io.ParseException: Name has not been supplied
>> 
>> ********End of trace**********************************
>> 
>> An example of an Id that worked is:
>> 
>> ZYM1_SCHPO
>> 
>> while an ID that didn't work is:
>> 
>> ZUO1_YEAST
>> 
>> Thanks a lot in advance.
>> 
>> Cheers,
>> Saif
>> 
>> 
>> -- 
>> Saif Ur-Rehman
>> 
>> Centre for Evolution, Genes and Genomics
>> Harold Mitchell Building
>> University of St Andrews
>> St Andrews
>> Fife
>> KY16 9TH
>> UK
>> 
>> Tel: +44 131 5572556
>> Fax: +44 1334 463366
>> 
>> 
>> ------------------------------
>> 
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> 
>> End of Biojava-l Digest, Vol 104, Issue 6
>> *****************************************
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 20 Sep 2011 12:18:55 +0800
> From: quan zou <guoer713108 at gmail.com>
> Subject: [Biojava-l] why can't biojava fold RNA?
> To: biojava-l at lists.open-bio.org
> Message-ID:
>    <CAOq1OFQaPuGvLwxgP8ZF2RM2EvHXmXEWO5z-6yLz5++9QpRMew at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Dear all,
> 
>        Is there any java program or jar which can fold a RNA sequence to a
> secondary structure? Such as RNAfold?
> 
>       Why RNAfold/ Vienna Package have not been contained in Biojava?
> 
>                 Quan
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 20 Sep 2011 08:11:58 -0700
> From: Andreas Prlic <andreas at sdsc.edu>
> Subject: Re: [Biojava-l] why can't biojava fold RNA?
> To: quan zou <guoer713108 at gmail.com>
> Cc: biojava-l at biojava.org
> Message-ID:
>    <CALthepzVkLuwgt4mEc_=7NnZu7tQDws8PbOV-YdUCHdD8oS7wg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> If all your code is in Java and you have binaries for some external
> software you can easily wrap it from Java and trigger the execution.
> 
> Andreas
> 
> On Tue, Sep 20, 2011 at 2:09 AM, quan zou <guoer713108 at gmail.com> wrote:
>> Thanks, however, there is no java code. it cannot be imported into my java
>> project.
>> 
>> 2011/9/20 Andreas Prlic <andreas at sdsc.edu>
>>> 
>>> Hi Quan,
>>> 
>>> the Vienna RNA package is available as open source. ?Did you take a look
>>> at it?
>>> 
>>> Andreas
>>> 
>>> 
>>> On Mon, Sep 19, 2011 at 9:18 PM, quan zou <guoer713108 at gmail.com> wrote:
>>>> Dear all,
>>>> 
>>>> ? ? ? ?Is there any java program or jar which can fold a RNA sequence to
>>>> a
>>>> secondary structure? Such as RNAfold?
>>>> 
>>>> ? ? ? Why RNAfold/ Vienna Package have not been contained in Biojava?
>>>> 
>>>> ? ? ? ? ? ? ? ? Quan
>>>> _______________________________________________
>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>> 
>> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> 
> End of Biojava-l Digest, Vol 104, Issue 7
> *****************************************




More information about the Biojava-l mailing list