Saturday, 15 February 2020

How To Speed-up Python Edit Distance Algorithm 26 Times Using Cython

I had an idea for a Python project, but I need some speed and I've found out, that Cython is the best option; all the code could be compiled in a normal way, just some modules will be rewritten in Cython.  How we go, let's take some real function (what I need, not a dummy example), edit distance, pure Python code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def edit_dist_dp(str1, str2, m, n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

Just for a start, making  pyx  extension file, recompiling it through Cython, and of course a test file(credits to mister sentdex, youtube):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import timeit

py = timeit.timeit('edit_distance.edit_dist_dp("asdsdhter", "dsfladrte", 9, 9)', 
  setup='import edit_distance', number=10000)

cy = timeit.timeit('edit_distance_cy.edit_dist_dp("asdsdhter", "dsfladrte", 9, 9)', 
  setup='import edit_distance_cy', number=10000)

print(f"Python and Cython times: {py}, Cython: {cy}")
print(f"Cython is {py/cy} x faster than Python")

Just recompiling pure Python code gives about three times faster code, not bad for a start, let's get some more C in Python, use more types hints, the essence of performance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def edit_dist_dp(str1, str2, m, n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]
    cdef int i, j

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

I gave to Cython a knowledge about a type of i and j, they are used a lot in the algorithm, the outcome of tests:

1
2
Python: 0.7150410660033231, Cython: 0.1197679779943428
Cython is 5.970219068381515 x faster than Python

Not that bad, nearly six times faster, let's give more, parameters have types, also returning value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
cpdef int  edit_dist_dp(str str1, str str2, int m, int n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]
    cdef int i, j

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

Getting better, nine times faster:

1
2
3
(project_venv) $ python tests.py 
Python: 0.7169225399993593, Cython: 0.07916731700242963
Cython is 9.055789271946113 x faster than Python

Get rid of Python array, list comprehension, the downside is, I've assumed, that the word is shorter than fifty characters, this to be figured out later, though:), finally, for now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
cpdef int  edit_dist_dp(str str1, str str2, int m, int n):

    cdef int i, j, x, y
    cdef int store[50][50]

    for x in range(n + 1):
        for y in range(m + 1):
             store[x][y] = 0

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

The tests:

1
2
3
(project_venv) $ python tests.py 
Python: 0.7155497479980113, Cython: 0.027351742995961104
Cython is 26.16102922960606 x faster than Python

That's not bad, still, we can get rid of Python string, use C++ instead, or maybe vector, (Cython gives more than that, even including direct C++ modules to code) but, I think, as for a real easy refactoring, 26 times speedup is really nice. 
I'm not putting whole code to GitHub, as usually do, because the final snippet is what is really needed, the rest you can figure out from sentdex tutorial, or any other, it's really easy, believe me. Thanks for reading.

Monday, 28 October 2019

Propositional Calculus Evaluation

Hi there, nothing new today, I've taken together, made it more usable and cleared a little bit some old code - parsing and evaluating of the logical expressions. I used an old tool: Reverse Polish Notation, where tokenizing, paring and evaluation are very simple. The whole file is about 195 lines of code. Unfortunately, the time complexity of the algorithm is exponential (in fact like any other algorithm for that, though the resolution method doing better for many types of formulas). How the tokenize, parse and eval functions works are described in greater details in many places, what I add is the fill_values function, which creates a whole table of logical values for the algorithm to check:

def fill_values(formula):
    """returns genexp with the all fillings with 0 and 1"""
    letters = "".join(set(re.findall("[A-Za-z]", formula)))
    for digits in product("10", repeat=len(letters)):
        table = str.maketrans(letters, "".join(digits))
        yield formula.translate(table)


It uses some Python features (that's why I like to use it, not for its efficiency, but because lots of libraries to quick get things done). The whole thing works something like that (asciinema screenacast):

https://asciinema.org/a/TarY4YQQTbm3To5ZpBXTci3Y4?t=8

Still lacks error handling, checks if a formula is syntactically correct. For example, it does this:

$ ./eval.py 
Propositional Logic Parser, press help, /h, -h or -usage for help
> q ~ a
Formula is satisfiable


I would like anybody doing some logic exercises, to try it out, or just for fun, that certainly will help:).

Code on Github . Out of the box here:
https://repl.it/@lion137/propositionalcalculluseval

 Thanks for patience, bye for now:)

Thursday, 10 October 2019

Skew Heap In Java

Hello, just 've finished my Java implementation of a Skew Heap data structure and, also heap sort using this type of heap. It's very simple, under the hood structure is a binary tree, heap order is maintained via special merge function. Here is a basic structure:
 
public class SkewHeap <V extends Comparable<V>> {

    SkewNode root;
    public SkewHeap() {
        this.root = null;
    }

    class SkewNode {
        SkewNode leftChild, rightChild;
        V data;

        public SkewNode(V _data, SkewNode _left, SkewNode _right) {
            this.data = _data;
            this.leftChild = _left;
            this.rightChild = _right;
        }
    }
}
 
And here the crucial merge:

SkewNode merge(SkewNode left, SkewNode right) {
        if (null == left)
            return right;
        if (null == right)
            return left;
        if (left.data.compareTo(right.data) < 0 || left.data.compareTo(right.data) == 0) {
            return new SkewNode(left.data, merge(right, left.rightChild), left.leftChild);
        } else
            return new SkewNode(right.data, merge(left, right.rightChild), right.leftChild);
    }


It's doing exactly what's Wikipedia says: maintains structure and keeping minimum in the root. The others pop and insert use this procedure. Sorting is done via three functions:


SkewHeap listToHeap(List list) {
        SkewHeap o = new SkewHeap();
        for (int i = 0; i < list.size(); i++)
            o.insert((Comparable) list.get(i));
        return o;
    }

List heapToList(SkewHeap tree) {
        List ll = new ArrayList();
        try {
            while (true) {
                ll.add(tree.pop());
            }
        }
        catch (IndexOutOfBoundsException e) {
            return ll;
        }
    }

List sort(List list) {
        return heapToList(listToHeap(list));
    }

The first as is seen packs the list parameter to the heap, another stack the whole heap to the list (popping minimums), the resulted list must be sorted than, and finally the sort composes them two. Performance of sorting is O(nlog(n)), though is worse than standard Java sort. Code on github. Thank you.  
Sources: Wikipedia , What is Good About Haskell

Thursday, 18 July 2019

AVL Tree in Java

I have not blogged for some time, but there is something new:  AVL Tree in Java. Code on github: https://github.com/lion137/Java .
Thanks, enjoy! 

Monday, 4 March 2019

Fundamental Algorithms: Parse to RPN

Another one, Parsing to Reverse Polish Notation. Why it's so important? For example, typing to the first calculators, one had to do it in RPN! It's very easy to find a value of an expression in RPN. Deadly simple, having one, like:
3 2 + 4 3 * +
It's enough to take en empty LIFO stack, go through it and:
1 When we see a number push it on stack.
2. Seeing operand, pop stack twice, perform the operation and push result.
3. When done pop stack - this is result!

Parsing is not such easy, but is explained in details, for example, here.
Code on github.
Whole Series.

Thank you!