[EMBOSS] FW: Reducing a FASTA repository, new user

Marvin Stodolsky marvin.stodolsky at gmail.com
Fri Feb 18 02:23:15 UTC 2011


Here is the attachment.
The whole cleanup process could be done with pm;y SED calls I'm sure,
but would be beyond my SED comfort level.


On Thu, Feb 17, 2011 at 12:06 PM, Tom Keller <kellert at ohsu.edu> wrote:
> HI Martin,
> I am interested i the solution. There was no attachment to the email I received. Would you mind sending it?
> thank you,
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
> On Feb 16, 2011, at 6:07 PM, Marvin Stodolsky wrote:
>> All thanks for the suggestions.  A solution to the GeneBegin..GeneEnd
>> problem has been worked out, per the Attachment, for those interested.
>> But for me the more important problem is making a FASTA repository,
>> which is a subset of the gene files in a much larger Repository.  This
>> is desirable before & after using Usearch -
>> http://www.drive5.com/usearch/intro.html
>> to select out a minimally homologous gene set of a species.
>> Elimination of RNA genes, cryptic viruses, SINE/LINE genes are among
>> the undesirables.
>> Specifically, is the command using ENTRET or relatives , to accept a list like
>> 637008924
>> 637008927
>> 640691430
>> 640691431
>> 637008928
>> 637008954
>> 637008980
>> for extraction and repacking into a single smaller Repository?
>> If not, could you recommend a software tool/suite for this type of job.
>> MarvS
>> On Tue, Feb 15, 2011 at 3:59 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>> On 14/02/2011 23:35, Marvin Stodolsky wrote:
>>>>  This is elementary I’m sure, but I’ve been unable to work out the
>>>> syntax  from the documentation.
>>>> More minor issue.
>>>> When using infoseq to extract all the fasta Headers from a sequence
>>>> Repository, the GeneBegin..GeneEnd (like   234466..234589) often fails to
>>>> come as a uniform field/fields in a resultant spreadsheet.  Is there a Fix
>>>> for this?
>>> I don't see the genebegin and geneend in EMBOSS infoseq output. Are they
>>> part of the sequence ID in the FASTA file?
>>> You can use a delimiter between items for infoseq using:
>>>  -nocolumn
>>> on the command line.
>>> For import into a spreadsheet you can set the delimiter to be tab with:
>>>  -nocolumn -delimiter "\t"
>>> on the command line. That should then import nicely into a spreadsheet.
>>> Hope that helps
>>> Peter Rice
>>> EMBOSS Team
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
-------------- next part --------------

With respect to using info in the FASTA description field, the intent and partial solution can now be explained.
The top level intent is to avoid overlapping genes, in a statiscal analysis being pl anned.
The 3rd & 4th lines below from an "infoseq -nocolumns" whole genome retreival. They report an overlap, i.e., 
the DNA gyrase A is overlapped by seryl-tRNA: serly_begin=7294 - 7322=gyrase_end < 0

DnaJ domain protein 1828..2760(+) [Mycoplasma genitalium G37]
DNA gyrase subunit B 2845..4797(+) [Mycoplasma genitalium G37]
DNA gyrase subunit A 4812..7322(+) [Mycoplasma genitalium G37]
seryl-tRNA synthetase 7294..8547(+) [Mycoplasma genitalium G37]
thymidylate kinase 8551..9183(+) [Mycoplasma genitalium G37]

In a few microbes I've checked, about a quarter of the genes have some putative overlap. These could contaminate the proteins/codon_usage statistical analysis being planned. Thus I wished an enmass way of recogizing the overlapping genes.
A non-elegant fix has been worked out.

Pulling the dataset into a spreadsheet, spaces in the description field were  next replaced with >< :
thymidylate 8551..9183(+)><[Mycoplasma><genitalium><G37]

Next ><[ is replace by "to be field seperator"  |[
and the file saved as:   Myc637000176m.csv 

to get rid of >< in the terminal common  [Mycoplasma><genitalium><G37], there was done 
$  cut -d"[" -f1 Myc637000176m.csv > Myc637000176m2.csv
resulting in :

internals are next mostly deleted with:
sed -e 's/<.*>//g'  Myc637000176m2.csv > Myc637000176m3.csv
resulting in:

The single remmaining >< is replaced with potential separator | 
sed -e 's/></|/g'  Myc637000176m3.csv > Myc637000176m4.csv
resulting in:
BASICALLY, the clever work is now done, and the rest is more routine manipulation.

A cleanup was done with:
sed -e 's/)|//g'  Myc637000176m4.csv > Myc637000176m5.csv
sed -e 's/(/|/g'  Myc637000176m5.csv > Myc637000176m6.csv
together changing the  (+)|  to   |+   ,that is a separated field

The replacement of the residual  ..  with potential separator | was easiest done as a within spreadsheet operation in its own field, because of too many residual "." in the whole file

After routine manipulations within the spread sheet, 
a view of the overlap detection section is:
 F       G      H                I               J  fields
Start 	End Begin-nextEnd  OR((H2<0),(H1<0))  Stable 0/1 Value, for SORTING on		
686	1828	0		FALSE		0		
1828	2760	85		FALSE		0		
2845	4797	15		TRUE		0		
4812	7322	-28		TRUE		1		
7294	8547	4		TRUE		1		
8551	9183	-27		TRUE		1
9156	9920	3		FALSE		0	
9923	11251	0		FALSE		0	

The overlapping genes have stable value 1,during  sorting, while field I FALSE/TRUE and not stable during SORTing

More information about the EMBOSS mailing list