Bioperl: NCBI Entrez queries and Perl file handling

Matthew Pocock mrp@sanger.ac.uk
Wed, 02 Jun 1999 16:04:50 +0100


Dear Simon,

Simon Twigger wrote:
> 
> Hi there,
> 
> <snip/>
> 
> When perl reads in a file say using the normal code such as:
> 
> open (FILE, "Hs.seq.uniq") or die "Cant open file: $!";
> 
> while (<FILE>) {
>         # deal with each line as it comes through
>         # for example, to look for a specific Unigene ID
>         if( /Hs.12345/) {
>                 # deal with the unigene information
>         }
> }
> 
> close FILE;
> 
> does it keep the whole thing in memory as it reads through the file or
> does it just keep the current line (in $_) in memory? If its the former
> then Im not sure if reading in a 60Mb file is a good thing, if its the
> latter, then file size shouldnt have too many adverse effects other than
> taking a while to go through the whole thing.
It should only keep $_ in memory. Of course, if you stoore this string
anywhere then it will be kept around. If you said @lines = <FH> then
@lines would contain one string for each line in the file so it would
hold 60mb. That is why line-by-line file parsing is good.
> 
> I also thought of trying to grep out the sequence rather than going all
> the way through the file sequentially as this seems pretty fast from the
> command line.
You should try the excelent indexer modules. I don't know if there is a
Bio::Index::* module that will index your file type - there is one for
fasta - but I guess you can roll your own from Bio::Index::Abstract. The
fasta implementation is very cool.
> 
> Any suggestions on efficient ways to pull data out of large flat files
> like this?
> 
> Thanks for any help you can give me!
> 
> Simon.
> 
> --
> --------------------------------------------------
> Simon Twigger, Ph.D.
> Laboratory for Genetic Research,
> Cardiovascular Research Center,
> Medical College of Wisconsin,
> 8701 Watertown Plank Road,
> Milwaukee, WI, 53226
> 
> http://legba.ifrc.mcw.edu/~simont/
> 
> tel. 414-456-4409               fax. 414-456-6516
> --------------------------------------------------
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================