[EMBOSS] SUMMARY: size limits on extractseq?

Iain Drummond idrummon at receptor.mgh.harvard.edu
Mon May 31 12:18:50 UTC 2004


This turned out to be operator error. extractseq works fine on large files.

The NCBI mouse chromosome files are actually a collection of contigs;
extractseq looks only at the first file. Both comments from Clinton and
Peter are on the mark; the problem was the sequence file.
-- 

Iain Drummond, Ph.D.
Assistant Professor
Department of Medicine, Harvard Medical School and
Renal Unit, Massachusetts General Hospital

Mailing address:
Renal Unit / MGH 149-8000
149 13th St. 
Charlestown, MA 02129

Tel: 617 726 5647
Fax: 617 726 5669

idrummond at partners.org
idrummon at receptor.mgh.harvard.edu

Lab Home Page:
http://danio.mgh.harvard.edu



> From: Clinton Fernandes <cfernand at utm.utoronto.ca>
> Reply-To: clintonf at interchange.ubc.ca
> Date: Sun, 30 May 2004 21:37:30 -0700
> To: Iain Drummond <idrummon at receptor.mgh.harvard.edu>
> Subject: Re: [EMBOSS] size limits on extractseq?
> 
> It may be the source of the file.
> 
> I have experienced difficulty sometimes saving a file in a Windows environment
> and working with the file in a Linux environment. While I obviously don't know
> the specifics of your environment, this may be something you haven't
> considered.
> 
> I never had a problem with extractseq, but what I was experiencing seemed
> somewhat similar to your problem. A sequence I downloaded from the internet
> had
> invisible characters that dos2unix did not remove. This resulted in some very
> quirky behaviour in an extraction program that I had coded. I had to copy the
> sequence to the clipboard and paste it into a text editor in my Linux machine.
> 
> Again, what I have described may be completely left-field to what you are
> experiencing, but it may bear looking into if the situations are similar.
> 
> -- 
> Clinton Fernandes
> Bioinformatician
> UBC, Dept. of Microbiology
> Wesbrook Bldg, Room 224
> 6174 University Blvd,
> Vancouver, BC, Canada
> 
> (604) 827-5160
> e-mail: clintonf at interchange.ubc.ca
> 
> 
> Quoting Iain Drummond <idrummon at receptor.mgh.harvard.edu>:
> 
>> Is there an upper limit on the file size that extractseq can handle?
>> 
>> I run into a problem using extractseq to get segments out of mouse
>> chromosome files. It will only access the first 19.7 MB or so of a 185 MB
>> file. The files are fasta files of the mouse genome build 32 from NCBI. Here
>> is what the problem looks like: (real file sizes are ls'd below)
>> 
>> $ extractseq
>> Extract regions from a sequence
>> Input sequence: mm_chr1.fa
>> Regions to extract (eg: 4-57,78-94) [1-19589943]:
>>  
>> $ extractseq
>> Extract regions from a sequence
>> Input sequence: mm_chr2.fa
>> Regions to extract (eg: 4-57,78-94) [1-19704910]:
>>  
>> $ ls -l
>> total 3782918
>> -rw-r--r--   1 nobody   nobody   200460727 May 25 23:56 mm_chr1.fa
>> -rw-r--r--   1 nobody   nobody   135604265 May 26 15:16 mm_chr10.fa
>> -rw-r--r--   1 nobody   nobody   123799904 May 26 15:17 mm_chr11.fa
>> -rw-r--r--   1 nobody   nobody   120032577 May 26 15:19 mm_chr12.fa
>> -rw-r--r--   1 nobody   nobody   119608375 May 26 15:20 mm_chr13.fa
>> -rw-r--r--   1 nobody   nobody   185168052 May 26 14:28 mm_chr2.fa
>> 
>> -- 
>> 
>> Iain Drummond, Ph.D.
>> Assistant Professor
>> Department of Medicine, Harvard Medical School and
>> Renal Unit, Massachusetts General Hospital
>> 
>> Mailing address:
>> Renal Unit / MGH 149-8000
>> 149 13th St. 
>> Charlestown, MA 02129
>> 
>> Tel: 617 726 5647
>> Fax: 617 726 5669
>> 
>> idrummond at partners.org
>> idrummon at receptor.mgh.harvard.edu
>> 
>> Lab Home Page:
>> http://danio.mgh.harvard.edu
>> 
>> 
>> 
> 





More information about the EMBOSS mailing list