[EMBOSS] FW: Reducing a FASTA repository, new user

Thu Feb 17 02:07:51 UTC 2011

All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
problem has been worked out, per the Attachment, for those interested.

But for me the more important problem is making a FASTA repository,
which is a subset of the gene files in a much larger Repository.  This
is desirable before & after using Usearch -
http://www.drive5.com/usearch/intro.html
to select out a minimally homologous gene set of a species.
Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
the undesirables.

Specifically, is the command using ENTRET or relatives , to accept a list like
637008924
637008927
640691430
640691431
637008928
637008954
637008980
for extraction and repacking into a single smaller Repository?

If not, could you recommend a software tool/suite for this type of job.

MarvS

On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>
>>  This is elementary I’m sure, but I’ve been unable to work out the
>> syntax  from the documentation.
>> More minor issue.
>>
>> When using infoseq to extract all the fasta Headers from a sequence
>> Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
>> come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
>> for this?
>
> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
> part of the sequence ID in the FASTA file?
>
> You can use a delimiter between items for infoseq using:
>
>  -nocolumn
>
> on the command line.
>
> For import into a spreadsheet you can set the delimiter to be tab with:
>
>  -nocolumn -delimiter "\t"
>
> on the command line. That should then import nicely into a spreadsheet.
>
> Hope that helps
>
> Peter Rice
> EMBOSS Team
>