[Bioperl-l] Processing large fasta sequences throught SeqIO
Jason Stajich
jason@chg.mc.duke.edu
Sat, 1 Sep 2001 16:41:09 -0400 (EDT)
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.
---559023410-851401618-999376869=:7819
Content-Type: TEXT/PLAIN; charset=US-ASCII
Josep -
Tracked down the bug - it is in Bio::SeqIO::largefasta.pm
I wrote the following test script to diagnose the problem as it caused a
lovely infinite loop. It appears this loop is what is filling up your
/tmp directory and hence the 'too many links' error.
You can do the following to fix your code w/o upgrading your bioperl
code locally ( since it is only checked in to the bioperl CVS repository).
where you have a loop getting all the sequences from the seqio stream -
> while ( $seq = $seqio->next_seq )
change it to
> while ( $seq = $seqio->next_seq && $seq->length() > 0 )
This is of course a workaround, but should take care of things.
Please let us know if the suggestion helps.
I have propigated this fix to branch-07 and main trunk. Thanks for you
patience and I hope this helps you accomplish your task.
Attached is the test script for those interested in playing around with
this more.
-jason
--------------------------------------------------------------
On Fri, 31 Aug 2001, Josep Francesc Abril Ferrando wrote:
> Hi Jason,
>
> > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory
> > > /tmp/Z0gD8R0rlB: Too many links at
> > > /usr/lib/perl5/site_perl/5.005//Bio/Root/IO.pm line 457
> >
> > Is your tmp dir really full of files/directories or have not enough space
> > for the collection of all the sequence data? This seems like a system
> > problem.
>
> Currently, "/tmp" is only ~150Mb and I have more than 1Gb of free hard disk space (on a PC box with
> 386Mb of RAM, Red Hat 6.2 with kernel version 2.2.14, and perl 5.6.1). Maybe it could be a
> permissions issue.
>
> > Do you have File::Temp installed? There is a known bug in 0.7 release
> > that if you do not have File::Temp installed the application will not
> > cleanup its tempdirs/tempfiles cleanly. Installing File::Temp will take
> > care of that.
>
> It is installed and it is version 0.12. Do I have to include the corresponding "use File::Temp;" in
> the script ?
> Maybe I have to tell our sysadmin to update both, File::Temp and BioPerl.
>
> > > If I look at the saved file, the sequence is OK (do not have more or
> > > less nucleotides than expected and they are in the correct ordering)
> > > but the file contains a lot of empty lines (or just having '>') after
> > > the finished sequence. Any idea of what should be wrong in the
> > > following script:
> >
> > Nothing obvious is jumping out right now by looking at your code -
> > How large are your files?
>
> At this moment I am working around 50Mbp length sequences, but I would like being able to scale up
> to 250Mbp.
>
> > > Is that the right way to use "Bio::SeqIO" for processing large fasta
> > > files. Do I have to include "Bio::Seq::LargeSeq" and, if yes, how can
> > > I do that ?
> >
> > you could add the line
> > use Bio::Seq::LargeSeq;
> > just below --> use Bio::SeqIO <--
> > if you wanted, but it is included by the largefasta modules so it is
> > optional.
>
> Well, I've made some test, including "use Bio::Seq::LargeSeq" first and then also with "use
> File::Temp", and I've got the same results (the same error/warning -only changing the temporary
> directory name that cannot be created- and the same trailing extra lines).
>
> Thanks again... Josep F.
>
> ________________________________________
>
> Josep Francesc ABRIL FERRANDO
>
> RESEARCH GROUP on BIOMEDICAL INFORMATICS
> GENOME INFORMATICS LAB
> IMIM - UPF
> C/ Dr. Aiguader 80
> 08003 - Barcelona (SPAIN)
>
> Ph: +34 93 2211009 ext 2016
> Fax: +34 93 2213237
>
> http://www1.imim.es/~jabril/
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
---559023410-851401618-999376869=:7819
Content-Type: APPLICATION/x-perl; name="josep_test.pl"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.GSO.4.05.10109011641090.7819@peptide.mc.duke.edu>
Content-Description:
Content-Disposition: attachment; filename="josep_test.pl"
IyEvdXNyL2Jpbi9wZXJsIC13CnVzZSBzdHJpY3Q7CnVzZSBGaWxlOjpUZW1w
IHF3KHRlbXBmaWxlIHRlbXBkaXIpOwp1c2UgRmlsZTo6UGF0aDsKCnVzZSBC
aW86OlJvb3Q6OklPOwp1c2UgQmlvOjpTZXE6OkxhcmdlUHJpbWFyeVNlcTsK
dXNlIEJpbzo6U2VxSU87Cm15ICRERUJVRyA9IDE7CgpteSAoJGRpciwkZmgs
JGZpbGVuYW1lLCRpbywgJHNlcWlvLCAkc2VxKTsKCgojIHRlc3QgRmlsZTo6
VGVtcAooICRkaXIpICA9IHRlbXBkaXIoQ0xFQU5VUCA9PiAxKTsKaWYoICEg
JGRpciApIHsgZGllICJlcnJvciBnZXR0aW5nIHRlbXBkaXJcbiI7IH0KKCAk
ZmgsICRmaWxlbmFtZSkgPSB0ZW1wZmlsZShESVIgPT4gJGRpcik7CmlmKCAh
ICRmaCB8fCAhICRmaWxlbmFtZSApIHsgZGllICJlcnJvciBnZXR0aW5nIHRl
bXBmaWxlXG4iOyB9CgpwcmludCAkZmggInRlc3Rpbmcgb3V0cHV0XG4iOwpp
ZiggJERFQlVHICkgeyAKICAgIHByaW50ICJmaWxlbmFtZSBpcyAkZmlsZW5h
bWUsIGRpciBpcyAkZGlyXG4iOwp9CgpGaWxlOjpQYXRoOjpybXRyZWUoJGRp
cik7CgokZmggPSB1bmRlZjsKJGZpbGVuYW1lID0gdW5kZWY7CiRkaXIgPSB1
bmRlZjsKCiMgdGVzdCBCaW86OlJvb3Q6OklPCiRpbyA9IG5ldyBCaW86OlJv
b3Q6OklPKC12ZXJib3NlID0+ICRERUJVRyApOwoKKCRkaXIpID0gJGlvLT50
ZW1wZGlyKENMRUFOVVAgPT4gMSk7CmlmKCAhICRkaXIgKSB7IGRpZSAiZXJy
b3IgZ2V0dGluZyBSb290OjpJTyB0ZW1wZGlyXG4iOyB9CgooJGZoLCAkZmls
ZW5hbWUpID0gJGlvLT50ZW1wZmlsZShESVIgPT4gJGRpcik7CmlmKCAhICRm
aCB8fCAhICRmaWxlbmFtZSApIHsgZGllICJlcnJvciBnZXR0aW5nIFJvb3Q6
OklPIHRlbXBmaWxlXG4iOyB9CgokaW8tPl9pb19jbGVhbnVwKCk7CnVuZGVm
ICRpbzsKCmlmKCAtZSAkZmlsZW5hbWUgKSB7ICAgCiAgICBwcmludCBTVERF
UlIgImNsZWFudXAgYnkgUm9vdDo6SU8gZGlkIG5vdCB3b3JrXG4iOwp9CkZp
bGU6OlBhdGg6OnJtdHJlZSgkZGlyKTsKCmlmKCAtZSAkZGlyICApIHsgICAK
ICAgIHByaW50IFNUREVSUiAiY2xlYW51cCBieSBybXRyZWUgZGlkIG5vdCB3
b3JrXG4iOwp9CgojIHRlc3QgQmlvOjpTZXE6OkxhcmdlUHJpbWFyeVNlcQoK
JHNlcSA9IG5ldyBCaW86OlNlcTo6TGFyZ2VQcmltYXJ5U2VxKC1pZCA9PiAn
dGVzdDEnLCAKCQkJCSAgICAgLXNlcSA9PiAnY2FndCcpOwokc2VxLT5hZGRf
c2VxdWVuY2VfYXNfc3RyaW5nKCdHQVRBR1RHQVRBR1QnKTsKCmlmKCBsYyAk
c2VxLT5zdWJzZXEoMSwgMTApIG5lICdjYWd0Z2F0YWd0JykgewogICAgZGll
KCJlcnJvciB3aXRoIEJpbzo6U2VxOjpMYXJnZVByaW1hcnlTZXEgaW1wbGVt
ZW50YXRpb24iKTsKfQoKJHNlcSA9IHVuZGVmOwoKIyB0ZXN0IEJpbzo6U2Vx
SU86OmxhcmdlZmFzdGEgaW4gbWFubmVyIHRoYXQgSm9zZXAgaXMgdXNpbmcg
aXQKCm15IEBiYXNlcyA9IHF3KEMgQSBHIFQpOwooJGRpcikgPSB0ZW1wZGly
KENMRUFOVVAgPT4gMSk7Cm15IEBmaWxlczsKZm9yZWFjaCAoIDEuLjEwICkg
ewogICAgbXkgJHNlcXVlbmNlID0gJyc7ICAgIAogICAgZm9yZWFjaCAoIDEu
LjMwMDAgKSB7ICRzZXF1ZW5jZSAuPSAkYmFzZXNbIGludCByYW5kKDQpXTsg
ICB9CiAgICAKICAgICggJGZoLCAkZmlsZW5hbWUpID0gdGVtcGZpbGUoRElS
ID0+ICRkaXIpOwogICAgcHJpbnQgIm5ldyB0bXBmaWxlIGlzICRmaWxlbmFt
ZVxuIjsKICAgIHB1c2ggQGZpbGVzLCAkZmlsZW5hbWU7CiAgICAkc2VxaW8g
PSBuZXcgQmlvOjpTZXFJTygtZmggPT4gJGZoLCAtZm9ybWF0ID0+ICdmYXN0
YScpOwogICAgCiAgICAkc2VxID0gbmV3IEJpbzo6U2VxOjpMYXJnZVByaW1h
cnlTZXEoLWlkID0+ICJ0ZXN0XyRfIiwgCgkJCQkJIC1zZXEgPT4gJHNlcXVl
bmNlKTsKICAgICRzZXFpby0+d3JpdGVfc2VxKCRzZXEpOwogICAgJHNlcWlv
ID0gdW5kZWY7CiAgICBjbG9zZSgkZmgpOwp9CgpwcmludCAiYWJvdXQgdG8g
cHJvY2VzcyBhZ2dyZWdhdGUgZmlsZXNcbiI7CgooICRmaCwgJGZpbGVuYW1l
KSA9IHRlbXBmaWxlKCBESVIgPT4gJGRpcik7Cm15ICRzZXFvdXQgPSBuZXcg
QmlvOjpTZXFJTygtZmggPT4gJGZoLCAtZm9ybWF0ID0+ICdsYXJnZWZhc3Rh
Jyk7Cm15ICRiaWdzZXEgPSBuZXcgQmlvOjpTZXE6OkxhcmdlUHJpbWFyeVNl
cSgtaWQgPT4gJ2JpZ3NlcScpOwoKZm9yZWFjaCBteSAkZmlsZSAoIEBmaWxl
cyApIHsKICAgIHByaW50ICJwcm9jZXNzaW5nIGZpbGU6ICRmaWxlXG4iOwoK
ICAgICRzZXFpbyA9IG5ldyBCaW86OlNlcUlPKC1maWxlID0+ICRmaWxlLCAt
Zm9ybWF0ID0+ICdsYXJnZWZhc3RhJyk7CiAgICB3aGlsZSggZGVmaW5lZCAo
ICRzZXEgPSAkc2VxaW8tPm5leHRfc2VxKSApIHsKCSRzZXFvdXQtPndyaXRl
X3NlcSgkc2VxKTsKCQoJIyB0aGlzIGlzIHRvIGJ1aWxkIGEgZ2lhbnQgYWdn
cmVnYXRlIHNlcXVlbmNlIAoJIyBub3Qgc3VyZSBpZiBpdCBpcyB3aGF0IEpv
c2VwIGlzIHJlYWxseSBkb2luZwoJIyBoYXZlIHRvIHBsYXkgdGhlc2UgZ2Ft
ZXMgYmVjYXVzZSBjYW5ub3QgY2FsbAoJIyBzZXEtPnNlcSgpIGlmIHNlcSBp
cyBzdWZmaWNlbnRseSBsYXJnZQoJIyBiZWNhdXNlIGVudGlyZSBzZXEgbWF5
IG5vdCBmaXQgaW50byBtZW1vcnkKCQoJbXkgJHN0YXJ0ID0gMTsKCW15ICRs
ZW5ndGggPSAkc2VxLT5sZW5ndGgoKTsKCXdoaWxlKCAkc3RhcnQgPCAkbGVu
Z3RoICkgewoJICAgICRiaWdzZXEtPmFkZF9zZXF1ZW5jZV9hc19zdHJpbmco
JHNlcS0+c3Vic2VxKCRzdGFydCwkc3RhcnQrOTk5KSk7CgkgICAgJHN0YXJ0
ICs9IDEwMDA7Cgl9CiAgICB9Cn0K
---559023410-851401618-999376869=:7819--