Manual
Usage
Usage: motexCPU|motexOMP|motexMPIStandard (Mandatory): -a, --alphabet <str> `DNA' for nucleotide sequences or `PROT' for protein sequences. You may use `USR' for user-defined alphabet; edit the file motexdefs.h accordingly. -i, --input-file <str> (Multi)FASTA input filename. -o, --output-file <str> MoTeX output filename. -d, --distance <int> The distance used for extracting the motifs. It can be either 0 (for Hamming distance) or 1 (for edit distance). -k, --motifs-length <int> The length for motifs. -e, --errors <int> Limit the max number of errors to this value. -q, --quorum <int> The quorum is the minimum percentage (%) of sequences in which a motif must occur. Optional: -Q, --max-quorum <int> The maximum percentage (%) of sequences in which a motif can occur (default: 100). -n, --num-of-occ <int> The minimum number of occurrences of a reported motif in any of the sequences (default: 1). -N, --max-num-of-occ <int> The maximum number of occurrences of a reported motif in any of the sequences (default: 10000). -s, --structured-motifs <str> Input filename for the structure of the boxes in the case of structured motifs. -S, --SMILE-out-file <str> SMILE-like output filename to be used by SMILE. -b, --background-in-file <str> MoTeX background filename for statistical evaluation passed as input. -t, --threads <int> Number of threads to be used by the OMP version (default: 4). -L, --long-sequences <int> If the number of input sequences is less than the number of processors used by the MPI version, this should be set to 1 (default: 0); useful for a few (or one) very long sequence(s), e.g. a chromosome. -u, --un-out-file <str> Output filename for foreground motifs not matched exactly with any background motif in the file passed with the `-b' option. -I, --un-in-file <str> Input filename of the aforementioned file with the unmatched motifs. These motifs will be approximately searched as motifs in the file passed with the `-i' option. -U, --SMILE-un-out-file <str> SMILE-like output filename for foreground motifs not matched exactly with any background motif in the file passed with the `-b' option.
Output
Each line of the output represents a valid motif, which is space-separated as follows:
1. the motif
2. the number of sequences in which the motif occurs
3. the total number of input sequences
4. the ratio of 2. to 3.
5. the total number of occurrences of the motif
Examples
Example 1.
In order to reproduce the results on the accuracy of MoTeX presented in
Solon P. Pissis, Alexandros Stamatakis, and Pavlos Pavlidis. MoTeX: A
word-based HPC tool for MoTif eXtraction. In Proceedings of the Fourth
ACM International Conference on Bioinformatics and Computational Biology
(ACM-BCB 2013), pp.13-22, 2013
change to directory `data' and follow the instructions in file README.
Example 2.
Here is a series of steps to extract single motifs and assess their statistical
significance:
1. Uncompress the input file `dnc_subtilis_330-30.seq.bz2'
bunzip2 ./data/dnc_subtilis_330-30.seq.bz2
2. Extract single motifs with 8 threads
./motexOMP -a DNA -i ./data/dnc_subtilis_330-30.seq -o single.motex -d 0 -k 6 -e 1 -q 12 -S single.smile -t 8
3. Assess their statistical significance
cd SMILE ./smile ../data/dnc_subtilis_330-30.seq ../single.smile single.smile.output 100 2
Example 3.
Here is a series of steps to extract structured motifs and assess their statistical
significance:
1. Uncompress the input file `dnc_subtilis_330-30.seq.bz2'
bunzip2 ./data/dnc_subtilis_330-30.seq.bz2
2. Extract structured motifs with 8 threads
./motexOMP -a DNA -i ./data/dnc_subtilis_330-30.seq -o struct.motex -d 0 -k 6 -e 1 -q 1 -s ./data/boxes.txt -S struct.smile -t 8
The first line of the file `./data/boxes.txt' is an integer number representing the total number of spacers of the structured motif. Every succeeding four lines are four integer numbers representing the min and the max length of the corresponding spacer, the length of the succeeding box, and the maximum number of errors allowed in the box, respectively.
3. Assess their statistical significance
cd SMILE ./smile ../data/dnc_subtilis_330-30.seq ../struct.smile struct.smile.output 100 2
Example 4.
Here is a series of steps which could potentially be used as a biological pipeline:
1. Run a background (bg) input dataset, and output it to `bg.motex'.
./motexOMP -a DNA -i bg.dataset -o bg.motex -d 0 -q 1 -k 10 -e 1 -t 4
2. Run a foreground (fg) input dataset, and output the fg motifs to `fg1.motex'; output the fg motifs that are not matched with any bg motif to `un.motex'.
./motexOMP -a DNA -i fg.dataset -b bg.motex -o fg1.motex -u un.motex -d 0 -q 1 -k 10 -e 1 -t 8
3. Check whether the fg motifs in `un.motex' are motifs in the bg dataset, and, if yes, output them to `fg2.motex'.
./motexOMP -a DNA -i bg.dataset -I un.motex -o fg2.motex -d 0 -q 1 -k 10 -e 1 -t 8