[Biojava-dev] List vs LinkedHashMap
Scooter Willis
HWillis at scripps.edu
Fri Feb 19 20:53:24 UTC 2010
Richard
For Stream parsing I have abstracted that down to a proxy data structure that looks just like ArrayListSequenceBackingStore that can keep an offset token in file stream where this makes sense for loading very large files without actually keeping everything in memory. You pay the price once to get the header information and the offset into the stream/file of the start of the sequence and the length. Then if the user makes a call to the Sequence to get either the actual sequence data or a subsequence then the required sequence is loaded from the stream/file. This doesn't make sense for slow io bound streams where the load penalty would be high but does work well for file IO via RandomAccessFile seek and how it is currently implemented. If you have a fasta file with 1GB of data but only plan on selecting 10 sequences but don't know what those 10 sequences are at load time then this works well.
This also allows you to load a large genome or genome scaffold file and by implementing the details in SequenceFileProxyLoader access sequence data without loading in the entire genome into memory. Here is two approaches to loading the same file found in FastaReader.java The first FastaReader passes in ProteinSequenceCreator that will handle the creation of the actual protein sequence and the storage. The second test case use FileProxyProteinSequence where you need to pass in a reference to the File and as the initial file is parsed once it simply keeps track of the locations. The actual ProteinSequence that gets created is a ProteinSequence where the store is a SequenceArrayListProxyLoader instead of ArrayListSequenceBackingStore. I have put together but haven't checked in a FastReaderHelper class with static methods to hide this detail from someone who simply wants to load a Fasta file.
String inputFile = "src/test/resources/PF00104_small.fasta";
FileInputStream is = new FileInputStream(inputFile);
FastaReader<ProteinSequence> fastaReader = new FastaReader<ProteinSequence>(is, new GenericFastaHeaderParser(), new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()));
Collection<ProteinSequence> proteinSequences = fastaReader.process();
is.close();
System.out.println(proteinSequences);
File file = new File(inputFile);
FastaReader<ProteinSequence> fastaProxyReader = new FastaReader<ProteinSequence>(file, new GenericFastaHeaderParser(), new FileProxyProteinSequenceCreator(file, AminoAcidCompoundSet.getAminoAcidCompoundSet()));
Collection<ProteinSequence> proteinProxySequences = fastaProxyReader.process();
System.out.println(proteinProxySequences);
So in the current approach I would always be able to return a collection with knowledge of the header and a sequence that will either have the sequence data or know how to get it. This same concept works for being able to create a ProteinSequence where you can have a UniprotProxyProteinSequenceLoader or NCBIProxyProteinSequenceLoader where you only need to pass in the sequence unique id. The loader can get the sequence detail at the time the ProteinSequence is created or do it lazily when a request is made. This then extends back to genome views of DNASequence data where you don't need to even have the genome local but the appropriate genome sequence proxy loader would do a web services/REST call to the external server to retrieve the actual sequence or subsequence that is being requested.
If you look at the AccessionID class I keep track of the type of accession id based on either recognizing the Fasta file header type or allowing the user to set it that will make working with features very powerful. If you know the accesion id and the type of id then making a request to a DAS server, Genome annotaiton server or Uniprot service to retreive features is easy. I haven't done that code yet but it is next on my list for a project I am working on.
We also worked on building in the sequence classes the proper biological relationships such that if you start with a DNA sequence and apply the various exon/intron features you can have a TranscriptSequence that can return a ProteinSequence. In the reverse direction you should be able to take a ProteinSequence with a valid accession id with a known type and retrieve the parent DNA sequence if that linkage information is available via the appropriate web services/REST call. Part of the design but going in reverse is not implemented. You can start with a ChromosomeSequence and work your way down by adding introns and extrons. Andy has worked hard on this code which will make it really easy to use by programmers who don't know all the details.
It has been a month since the BioJava Hackathon and feeling guilty that I haven't taken the time to write any of this up. Writing code is the easy part doing the documentation is always tough! I will see if this email generates a larger discussion among the list and based on how everything shakes out will turn the discussion into a wiki page to give a sequence design overview and code for testing and implementing of other proxy loaders.
Thanks
Scooter Willis
On Feb 19, 2010, at 2:51 PM, Richard Holland wrote:
> Depends on whether or not you want to parse-at-once or stream-parse. If the parser is set up to load the whole lot at once, then a map is fine, otherwise not.
>
> On 20 Feb 2010, at 08:30, Scooter Willis wrote:
>
>>
>> I am starting to use the new FastaReader in a project and the default implementation I setup returns a List<ProteinSequence>. The very next thing I needed to do was convert to a LinkedHashMap<String,ProteinSequence> so I could query the sequence of interest. It would seem that this is probably a fairly standard use case. If I returned a LinkedHashMap<String,ProteinSequence> as the default container then we have a slight memory hit on keeping a hash of the accession ID and a linked list for preserving order.
>>
>> Does anyone have objections to returning the sequences read from a Fasta file as a LinkedHashMap?
>>
>> Thanks
>>
>> Scooter
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
More information about the biojava-dev
mailing list