Pattern matching (including string matching, sequence matching...) is an important class of algorithms in Big Data and HPDA (High-Performance Data Analytics). For example, in bioinformatics, such algorithm applies to DNA sequence matching. Another domain of application is data mining. One example of pattern matching is exact string matching that is provided by tools like GNU grep. In addition, the target application of this project includes another parameters: the notion of approximation. Indeed, the notion of approximate matching allows a sequence to be found in a large database with some differences including insertion and deletion. Thus, the notion of distance is used to check if a match occurs or not between the searched pattern and the current processed text.
This code is based on the computation of the Levenshtein Distance, and more precisely on the C implementation available on Wikibooks here:
#define MIN3(a, b, c) ((a) < (b) ? ((a) < (c) ? (a) : (c)) : ((b) < (c) ? (b) : (c))) int levenshtein(char *s1, char *s2) { unsigned int s1len, s2len, x, y, lastdiag, olddiag; s1len = strlen(s1); s2len = strlen(s2); unsigned int column[s1len+1]; for (y = 1; y <= s1len; y++) { column[y] = y; } for (x = 1; x <= s2len; x++) { column[0] = x; for (y = 1, lastdiag = x-1; y <= s1len; y++) { olddiag = column[y]; column[y] = MIN3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1)); lastdiag = olddiag; } } return(column[s1len]); }
To adapt this notion of distance to our problem, we will consider the pattern as the first string s1
and the dna database as s2
.
The previous function checks the distance of string s1
with string s2
.
In our case, it is only necessary to check the pattern size to check if there is a match starting at the beginning of the dna database.
Then, by repeating the same algorithm for each possible match position (i.e., each character of the database), we can count the total number of matches depending on the target distance.
In addition to adapt the algorithm, we made the following changes:
column
array outside of the levenshtein
functionsrc/
: directory containing the source code apm.c
obj/
: directory that will contain the object files generated during the compilation process
dna/
: directory containing small dna sequences for testing
Makefile
: File containing rules to compile the code
To compile the application, run the make
command to generate the apm
binary.
By launching the executable without argument, some help is printed:
$ ./apm Usage: ./apm approximation_factor dna_database pattern1 pattern2 ...The first argument
approximation_factor
is an integer that represents the distance allowed between the patterns and the match.
The second argument dna_database
is a text file containing the big database in which the patterns are searched for.
Finally, the rest of the command line is dedicated to case-sensitive patterns.
To test the application, you can use the two small files located in the dna
directory.
To evaluate performance on larger database, you can download genome parts on this website.
For example, human chromosoms are available here.
Download the FASTA format which is compatible with the application.
Be careful about the size of each file and the time to read it from the NFS filesystem!
The parallelism available in this code can be based on the matching of one pattern (by splitting the dna sequence) and/or on the matching of multiple patterns at the same time. Of course both parallelism can be exploited at the same time with the parallel programming models like MPI, OpenMP or CUDA.