Tuesday 11 August 2020

Comparing Cython to Java

 Hi, do you remember this one?

I've just made comparison of this code to Java, and Cython does, imo, well. The same edit distance loop is only, on average 25 percent faster on JVM.

Thanks.

Saturday 15 February 2020

How To Speed-up Python Edit Distance Algorithm 26 Times Using Cython

I had an idea for a Python project, but I need some speed and I've found out, that Cython is the best option; all the code could be compiled in a normal way, just some modules will be rewritten in Cython.  How we go, let's take some real function (what I need, not a dummy example), edit distance, pure Python code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def edit_dist_dp(str1, str2, m, n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

Just for a start, making  pyx  extension file, recompiling it through Cython, and of course a test file(credits to mister sentdex, youtube):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import timeit

py = timeit.timeit('edit_distance.edit_dist_dp("asdsdhter", "dsfladrte", 9, 9)', 
  setup='import edit_distance', number=10000)

cy = timeit.timeit('edit_distance_cy.edit_dist_dp("asdsdhter", "dsfladrte", 9, 9)', 
  setup='import edit_distance_cy', number=10000)

print(f"Python and Cython times: {py}, Cython: {cy}")
print(f"Cython is {py/cy} x faster than Python")

Just recompiling pure Python code gives about three times faster code, not bad for a start, let's get some more C in Python, use more types hints, the essence of performance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def edit_dist_dp(str1, str2, m, n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]
    cdef int i, j

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

I gave to Cython a knowledge about a type of i and j, they are used a lot in the algorithm, the outcome of tests:

1
2
Python: 0.7150410660033231, Cython: 0.1197679779943428
Cython is 5.970219068381515 x faster than Python

Not that bad, nearly six times faster, let's give more, parameters have types, also returning value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
cpdef int  edit_dist_dp(str str1, str str2, int m, int n):
    store = [[0 for x in range(n + 1)] for x in range(m + 1)]
    cdef int i, j

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

Getting better, nine times faster:

1
2
3
(project_venv) $ python tests.py 
Python: 0.7169225399993593, Cython: 0.07916731700242963
Cython is 9.055789271946113 x faster than Python

Get rid of Python array, list comprehension, the downside is, I've assumed, that the word is shorter than fifty characters, this to be figured out later, though:), finally, for now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
cpdef int  edit_dist_dp(str str1, str str2, int m, int n):

    cdef int i, j, x, y
    cdef int store[50][50]

    for x in range(n + 1):
        for y in range(m + 1):
             store[x][y] = 0

    for i in range(m + 1):
        for j in range(n + 1):

            if i == 0:
                store[i][j] = j

            elif j == 0:
                store[i][j] = i

            elif str1[i - 1] == str2[j - 1]:
                store[i][j] = store[i - 1][j - 1]

            else:
                store[i][j] = 1 + min(store[i][j - 1], store[i - 1][j],
                                      store[i - 1][j - 1])

    return store[m][n]

The tests:

1
2
3
(project_venv) $ python tests.py 
Python: 0.7155497479980113, Cython: 0.027351742995961104
Cython is 26.16102922960606 x faster than Python

That's not bad, still, we can get rid of Python string, use C++ instead, or maybe vector, (Cython gives more than that, even including direct C++ modules to code) but, I think, as for a real easy refactoring, 26 times speedup is really nice. 
I'm not putting whole code to GitHub, as usually do, because the final snippet is what is really needed, the rest you can figure out from sentdex tutorial, or any other, it's really easy, believe me. Thanks for reading.