Monday, 23 January 2017

Text Segmentation (Maximum Matching) in Python

Today another algorithm in the set Algorithms in Python, part one here - maximum matching - it's a text segmentation algorithm - separates word in a  text, with laguages with no clear word separator, like Chinesse. It's simple, that's why works only for short words texts, again, an example is  Chinesse. This is a code:  

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
D = ['danny', 'condo', 'a', 'the','to','has', 'been', 'unable', 'go', 'at']


def max_match(sentence, dictionary):
    if not sentence:
        return ""
    for i in range(len(sentence), -1, -1):
        first_word = sentence[:i]
        remainder = sentence[i:]
        if first_word in dictionary:
            return first_word + " " + max_match(remainder, dictionary)
    first_word = sentence[0]
    remainder = sentence[1:]
    return first_word + max_match(remainder, dictionary)

print(max_match('atcondogo', D))

Let's go through it, dictionary is a simple Python list, algorithm starts at the beginning of a input string and try to match, with the dictionary, the longest word - line 8 (in that case this is a whole string and the reminder is empty), if it succeed returns first word with added space bar separator and recursive call from the remainder; if not, pointer goes one char backwards and  so on in this loop. If no word matches, pointer advances one character (creates one character word), and is recursively applied to the rest of the string - lines 12 - 14.
Line 16 corectrly prints:  

1
"at condo go" 
 But if we add the word 'atcon' to the dictionary, algorithm incorrectly segments to:

1
"atcon dogo"

Not very usefull to English though. That's it, code of course goes to github, together with: Extended Euclicid, modular exponeneation and the longest palindrome finder. Enjoy!

No comments: