[Biojava-dev] How to read a protein seq alignment file (FASTA)

Andreas Prlic andreas at sdsc.edu
Thu Jul 5 15:35:45 UTC 2012


Hi Sula,


> Are there methods/libraries in bio java to read files from PDB
> database and PISA DB
> (http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html) ?
>


There are a lot of 3D protein structure related features in BioJava and PDB
parsing has been around for quite some time.

http://biojava.org/wiki/BioJava:CookBook#Protein_Structure

Regarding Pisa, it depends what you need. If your goal is to re-create the
biological assembly, BioJava can help. However it does not use
the original PISA files, but whatever is archived in the PDB/mmCif files.
In fact I am just working on the support for biological assemblies and it
will be announced shortly. If you need more low-level access to PISA files,
there is currently no parser for this, however it would be interesting to
add and we would accept patches for that.

Andreas




>
> thx
>
> SR
>
> On Mon, Jul 2, 2012 at 7:47 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> > Thanks, Dan. This looks good to me and I committed this new constructor
> to
> > SVN. If you want to send over also the rest of the code to build up the
> > profile from an aligned fasta, I'll be happy to patch that too...
> >
> > Andreas
> >
> >
> >
> > On Mon, Jul 2, 2012 at 4:22 PM, Don Naki <dnaki1 at cox.net> wrote:
> >
> >> Hi Andreas, essentially, you are right. I *think* it's possible to
> create
> >> a profile containing many sequences as long as the biojava API is used
> to
> >> construct the profile. The issue is constructing a profile from
> previously
> >> aligned sequences, i.e. using a pre-existing alignment file.
> >>
> >> It would be really nice if there was a reader class that allowed one to
> >> read a protein Fasta alignment file and create a Profile directly from
> the
> >> already aligned sequences.
> >>
> >> There doesn't appear to be such a reader (unless I've not found it).
> >> However, there is a fasta reader that will read the aligned sequences in
> >> the fasta alignment file and create ProteinSequence objects. OK, so I
> >> figure now all I have to do is convert these ProteinSequence objects to
> >> AlignedSequence objects and use the AlignedSequences to populate a
> Profile.
> >> So I convert the ProteinSequence objects to String before manually
> creating
> >> AlignedSequence objects, (inelegant, but there doesn't seem to be
> another
> >> way unless I'm missing something). Now the problem is that there is no
> way
> >> to construct a Profile from these aligned sequences if you have more
> than
> >> two of them.
> >>
> >> Looking at the source code for SimpleProfile, there's no inherent
> >> limitation on the number of aligned sequence members; it's just that
> there
> >> are no constructors or mutators that accept a collection of
> >> AlignedSequences.
> >> I took a stab at such a constructor; it seems to work fine, but I
> haven't
> >> tested it with biojava classes that interact with SimpleProfile. Any
> chance
> >> someone could evaluate this and consider adding it to SimpleProfile?
> >> Perhaps then that reader class would be the next step ;-)
> >>
> >> Many thanks,
> >> Don
> >>
> >>         /**
> >>          * Creates a profile for the already aligned sequences.
> >>          * @param alignedSequences the already aligned sequences
> >>          * @throws IllegalArgument if aligned sequences differ in
> length or
> >>          * collection is empty.
> >>          */
> >>         public SimpleProfile(Collection<AlignedSequence<S,C>>
> >> alignedSequences) {
> >>             list = new ArrayList<AlignedSequence<S,C>>();
> >>             originals = new ArrayList<S>();
> >>
> >>             Iterator<AlignedSequence<S,C>> itr =
> >> alignedSequences.iterator();
> >>             if(!itr.hasNext()) {
> >>                 throw new IllegalArgumentException("alignedSequences
> must
> >> not be empty");
> >>             }
> >>
> >>             AlignedSequence<S, C> curAlignedSeq = itr.next();
> >>             length = curAlignedSeq.getLength();
> >>             list.add(curAlignedSeq);
> >>             originals.add((S) curAlignedSeq.getOriginalSequence());
> >>
> >>             while (itr.hasNext()) {
> >>                 curAlignedSeq = itr.next();
> >>                 if (curAlignedSeq.getLength() != length) {
> >>                     throw new IllegalArgumentException("Aligned
> sequences
> >> differ in size");
> >>                 }
> >>                 list.add(curAlignedSeq);
> >>                 originals.add((S) curAlignedSeq.getOriginalSequence());
> >>             }
> >>             list = Collections.unmodifiableList(list);
> >>             originals = Collections.unmodifiableList(originals);
> >>         }
> >>
> >> On Jul 2, 2012, at 1:17 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> >>
> >> Is the problem that the SimpleProfile method makes it difficult to
> >> re-create an instance with custom data, because there are no
> >> set-methods?
> >>
> >> Andreas
> >>
> >>
> >> On Mon, Jul 2, 2012 at 9:28 AM, Spencer Bliven <sbliven at ucsd.edu>
> wrote:
> >>
> >> Don–
> >>
> >>
> >> I was trying to do this a while ago and got stuck in the same place. I
> >>
> >> assumed that someone intended to implement a multiple alignment Profile,
> >>
> >> but never got around to it. I didn't have the time to implement it
> properly
> >>
> >> so I ended up just working with lists of ProteinSequences. It's possible
> >>
> >> that this is implemented as a subclass of one of the multiple alignment
> >>
> >> algorithms or something. If not, this is definitely a hole in BioJava
> that
> >>
> >> should be filled.
> >>
> >>
> >> -Spencer
> >>
> >>
> >> On Fri, Jun 29, 2012 at 11:22 AM, <dnaki1 at cox.net> wrote:
> >>
> >>
> >>
> >> Hi,
> >>
> >> I would like to use biojava 3 to read a protein multiple sequence
> >>
> >> alignment file in FASTA format containing 5 sequences.
> >>
> >> Is this possible? It appears Profile<S,C> is the alignment interface,
> but
> >>
> >> I can't find an implementation that allows me to add more than 2 aligned
> >>
> >> sequences.
> >>
> >> Any help appreciated. Thanks
> >>
> >> Don Naki
> >>
> >> _______________________________________________
> >>
> >> biojava-dev mailing list
> >>
> >> biojava-dev at lists.open-bio.org
> >>
> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >> biojava-dev mailing list
> >>
> >> biojava-dev at lists.open-bio.org
> >>
> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>
> >>
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>




More information about the biojava-dev mailing list