Approximate Pattern Matching

This topic focuses on a Big Data algorithm that relies on pattern matching with the notion of approximation to allow matches with small differences (e.g., character insertion).

Description

Pattern matching (including string matching, sequence matching...) is an important class of algorithms in Big Data and HPDA (High-Performance Data Analytics). For example, in bioinformatics, such algorithm applies to DNA sequence matching. Another domain of application is data mining. One example of pattern matching is exact string matching that is provided by tools like GNU grep. In addition, the target application of this project includes another parameters: the notion of approximation. Indeed, the notion of approximate matching allows a sequence to be found in a large database with some differences including insertion and deletion. Thus, the notion of distance is used to check if a match occurs or not between the searched pattern and the current processed text.

This code is based on the computation of the Levenshtein Distance, and more precisely on the C implementation available on Wikibooks here:

#define MIN3(a, b, c) ((a) < (b) ? ((a) < (c) ? (a) : (c)) : ((b) < (c) ? (b) : (c)))

int levenshtein(char *s1, char *s2) {
    unsigned int s1len, s2len, x, y, lastdiag, olddiag;

    s1len = strlen(s1);
    s2len = strlen(s2);

    unsigned int column[s1len+1];

    for (y = 1; y <= s1len; y++) {
        column[y] = y;
    }

    for (x = 1; x <= s2len; x++) {
        column[0] = x;

        for (y = 1, lastdiag = x-1; y <= s1len; y++) {
            olddiag = column[y];
            column[y] = MIN3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] == s2[x-1] ? 0 : 1));
            lastdiag = olddiag;
        }
    }
    return(column[s1len]);
}

To adapt this notion of distance to our problem, we will consider the pattern as the first string s1 and the dna database as s2. The previous function checks the distance of string s1 with string s2. In our case, it is only necessary to check the pattern size to check if there is a match starting at the beginning of the dna database. Then, by repeating the same algorithm for each possible match position (i.e., each character of the database), we can count the total number of matches depending on the target distance.

In addition to adapt the algorithm, we made the following changes:

Allow matching of multiple patterns

Allow the user to choose the maxmimum distance for matches

Allocate the column array outside of the levenshtein function

Source Code

The source code for this project is available here: apm.tgz. This archive contains:

src/: directory containing the source code apm.c
obj/: directory that will contain the object files generated during the compilation process
dna/: directory containing small dna sequences for testing
Makefile: File containing rules to compile the code

To compile the application, run the make command to generate the apm binary. By launching the executable without argument, some help is printed:

$ ./apm 
Usage: ./apm approximation_factor dna_database pattern1 pattern2 ...

The first argument approximation_factor is an integer that represents the distance allowed between the patterns and the match. The second argument dna_database is a text file containing the big database in which the patterns are searched for. Finally, the rest of the command line is dedicated to case-sensitive patterns.

To test the application, you can use the two small files located in the dna directory. To evaluate performance on larger database, you can download genome parts on this website. For example, human chromosoms are available here. Download the FASTA format which is compatible with the application. Be careful about the size of each file and the time to read it from the NFS filesystem!

Parallelism Hint

The parallelism available in this code can be based on the matching of one pattern (by splitting the dna sequence) and/or on the matching of multiple patterns at the same time. Of course both parallelism can be exploited at the same time with the parallel programming models like MPI, OpenMP or CUDA.