[Biojava-dev] List vs LinkedHashMap

Andy Yates ayates at ebi.ac.uk
Sat Feb 20 10:47:40 UTC 2010


Hey guys,

All the things that Scooter has done here I think is fantastic and the  
thought in the abstractions behind the loaders is really good. I'm  
especially liking the idea of being able to move into external  
resources for sequences. One thing that has always annoyed me is if I  
wanted to do some coding on a peptide sequence is having to download  
if from say UniProt/UniParc, save it into a file, read it & then do  
something. A store backing onto the large sequence repositories is a  
great win. I think the ones to target first for these systems are  
UniProt, eFetch & dbfetch. The last 2 are very important because they  
give us access to a huge number of databases from the single  
interface. One thing to remember when writing these classes is that  
the inbuilt HTTPConnection code with Java was always buggy and would  
leak sockets under some circumstances if you do not always read the  
out and the error streams. But I'm sure the IO utility code we've got  
can be modified to ensure we don't leak them :).

In terms of the stuff I've been doing I've pushed everything into some  
lower level classes so in order to go from DNA to RNA you instantiate  
a class which can handle nucleotides. If you're in a DNASequence then  
there are already methods on there to go to RNA & from RNA to Protein.  
All the code which does this is held in other classes so it's all  
design by composition rather than inheritance.

The way I'm currently imagining how you can move from one sequence to  
another is the registration of type specific features and then  
offering these sub-structures using the SequenceView code. So if we  
had a Gene the transcript could be defined by TransciptSequence.class  
& then when you request it we can then send back a SequenceView with  
ExonSequence.class objects registered to give the Exons & well I'm  
sure you can all see where it's coming. One thing I can't handle ATMO  
are phases so the code assumes everything starts in phase 1. For what  
we've got ATMO it's fine but later on this needs to be addressed.

I'm also feeling guilty but it's quite hard getting the time to get  
the code down let alone documentation. So long as we make sure there's  
test cases available then we can see how to use the code as well for  
when we get round to documentation.

Andy

On 19 Feb 2010, at 20:53, Scooter Willis wrote:

> Richard
>
> For Stream parsing I have abstracted that down to a proxy data  
> structure that looks just like ArrayListSequenceBackingStore that  
> can keep an offset token in file stream where this makes sense for  
> loading very large files without actually keeping everything in  
> memory. You pay the price once to get the header information and the  
> offset into the stream/file of the start of the sequence and the  
> length. Then if the user makes a call to the Sequence to get either  
> the actual sequence data or a subsequence then the required sequence  
> is loaded from the stream/file. This doesn't make sense for slow io  
> bound streams where the load penalty would be high but does work  
> well for file IO via RandomAccessFile seek and how it is currently  
> implemented. If you have a fasta file with 1GB of data but only plan  
> on selecting 10 sequences but don't know what those 10 sequences are  
> at load time then this works well.
>
> This also allows you to load a large genome or genome scaffold file  
> and by implementing the details in SequenceFileProxyLoader access  
> sequence data without loading in the entire genome into memory. Here  
> is two approaches to loading the same file found in FastaReader.java  
> The first FastaReader passes in ProteinSequenceCreator that will  
> handle the creation of the actual protein sequence and the storage.  
> The second test case use FileProxyProteinSequence where you need to  
> pass in a reference to the File and as the initial file is parsed  
> once it simply keeps track of the locations. The actual  
> ProteinSequence that gets created is a ProteinSequence where the  
> store is a SequenceArrayListProxyLoader instead of  
> ArrayListSequenceBackingStore. I have put together but haven't  
> checked in a FastReaderHelper class with static methods to hide this  
> detail from someone who simply wants to load a Fasta file.
>
>            String inputFile = "src/test/resources/ 
> PF00104_small.fasta";
>            FileInputStream is = new FileInputStream(inputFile);
>
>            FastaReader<ProteinSequence> fastaReader = new  
> FastaReader<ProteinSequence>(is, new GenericFastaHeaderParser(), new  
> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet 
> ()));
>            Collection<ProteinSequence> proteinSequences =  
> fastaReader.process();
>            is.close();
>
>
>            System.out.println(proteinSequences);
>
>            File file = new File(inputFile);
>            FastaReader<ProteinSequence> fastaProxyReader = new  
> FastaReader<ProteinSequence>(file, new GenericFastaHeaderParser(),  
> new FileProxyProteinSequenceCreator(file,  
> AminoAcidCompoundSet.getAminoAcidCompoundSet()));
>            Collection<ProteinSequence> proteinProxySequences =  
> fastaProxyReader.process();
>
>            System.out.println(proteinProxySequences);
>
>
> So in the current approach I would always be able to return a  
> collection with knowledge of the header and a sequence that will  
> either have the sequence data or know how to get it. This same  
> concept works for being able to create a ProteinSequence where you  
> can have a UniprotProxyProteinSequenceLoader or  
> NCBIProxyProteinSequenceLoader where you only need to pass in the  
> sequence unique id. The loader can get the sequence detail at the  
> time the ProteinSequence is created or do it lazily when a request  
> is made. This then extends back to genome views of DNASequence data  
> where you don't need to even have the genome local but the  
> appropriate genome sequence proxy loader would do a web services/ 
> REST call to the external server to retrieve the actual sequence or  
> subsequence that is being requested.
>
> If you look at the AccessionID class I keep track of the type of  
> accession id based on either recognizing the Fasta file header type  
> or allowing the user to set it that will make working with features  
> very powerful. If you know the accesion id and the type of id then  
> making a request to a DAS server, Genome annotaiton server or  
> Uniprot service to retreive features is easy. I haven't done that  
> code yet but it is next on my list for a project I am working on.
>
> We also worked on building in the sequence classes the proper  
> biological relationships such that if you start with a DNA sequence  
> and apply the various exon/intron features you can have a  
> TranscriptSequence that can return a ProteinSequence. In the reverse  
> direction you should be able to take a ProteinSequence with a valid  
> accession id with a known type and retrieve the parent DNA sequence  
> if that linkage information is available via the appropriate web  
> services/REST call. Part of the design but going in reverse is not  
> implemented. You can start with a ChromosomeSequence and work your  
> way down by adding introns and extrons. Andy has worked hard on this  
> code which will make it really easy to use by programmers who don't  
> know all the details.
>
> It has been a month since the BioJava Hackathon and feeling guilty  
> that I haven't taken the time to write any of this up. Writing code  
> is the easy part doing the documentation is always tough! I will see  
> if this email generates a larger discussion among the list and based  
> on how everything shakes out will turn the discussion into a wiki  
> page to give a sequence design overview and code for testing and  
> implementing of other proxy loaders.
>
> Thanks
>
> Scooter Willis
>
>
>
>
>
>
>
> On Feb 19, 2010, at 2:51 PM, Richard Holland wrote:
>
>> Depends on whether or not you want to parse-at-once or stream- 
>> parse. If the parser is set up to load the whole lot at once, then  
>> a map is fine, otherwise not.
>>
>> On 20 Feb 2010, at 08:30, Scooter Willis wrote:
>>
>>>
>>> I am starting to use the new FastaReader in a project and the  
>>> default implementation I setup returns a List<ProteinSequence>.  
>>> The very next thing I needed to do was convert to a  
>>> LinkedHashMap<String,ProteinSequence> so I could query the  
>>> sequence of interest. It would seem that this is probably a fairly  
>>> standard use case. If I returned a  
>>> LinkedHashMap<String,ProteinSequence> as the default container  
>>> then we have a slight memory hit on keeping a hash of the  
>>> accession ID and a linked list for preserving order.
>>>
>>> Does anyone have objections to returning the sequences read from a  
>>> Fasta file as a LinkedHashMap?
>>>
>>> Thanks
>>>
>>> Scooter
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/







More information about the biojava-dev mailing list