Friday 3 February 2017

Python Inverted Index Algorithm

As a continuation of interesting algorithms, today another text processing tool: Inverted Index.  
How it works, briefly. Let's say, that we have a list of documents, maybe large: thousands books, articles or so and we want effectively query this list, looking for information. It's done in the way, that the corpora is saved on disk (or we can use database) and we create a special data structure to do search and then retrieve matched documents (the database part is not the part of this article, I hope I will implement it soon in this project).  
I used Python's dicts and sets, the best would be  go through this:  
I have 2 tests documents:


1
2
3
4
5
d1 = ['i', 'did', 'enact', 'julius', 'caesar', 'i', 'was', 'kill' , 'i', 'the', 'capitol',  'brutus','me']
d1 = set(d1)
d2 = ['i', 'so', 'let', 'it', 'be', 'with' , 'caesar', 'the', 'noble', 'brutus', 'hath', 'told', 'you', 
      'caesar', 'was', 'ambitious']
d2 = set(d2)

After tokenization, we may have:

d1 = ['i', 'did', 'enact', 'julius', 'caesar', 'i', 'was', 'kill' , 'i', 'the', 'capitol',  'brutus','me']
d1 = set(d1)
d2 = ['i', 'so', 'let', 'it', 'be', 'with' , 'caesar', 'the', 'noble', 'brutus', 'hath', 'told', 'you', 
      'caesar', 'was', 'ambitious']
d2 = set(d2)

Now create a, to be searched, documents list.


1
2
3
doc_list = []
doc_list.append(d1)
doc_list.append(d2)

For really big corporas, this (doc_list) should be a function-  fetched and prepared document one by one to create the inverted list structure.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
tokens = d1.union(d2)
inv_index = {}
for x in tokens:
    tmp, tmp1 = find(x, doc_list)
    inv_index[(x, tmp1)] = tmp 

def find(elem, d_list):
    l_return = []
    i = 0
    for x in d_list:
        if elem in x:
            l_return.append(i)
        i += 1
    return (l_return, len(l_return))

def find_key(x, d):
    for y in d.keys():
        if x in y:
            return y
    return "word not in a texts"

Tokens is a set of the all unique words in the all documents, as a keys in the inv_index dict there are tuples: token, frequency: in how many documents the word is present - this is the length of an actual value of the key. Values: as 've said, for every key word it is  a list of documents which contains this word, it's implemented here as a list of integers - zero for the first document, one for the next and so on. The functions find and find_key are just helping to create the dict. Searches, in the documents, are now boolean queries on our dict values, for example:

1
('cat' AND 'lion') AND NOT 'mouse'

In fact I'll do it in slightly different manner, I will parse it and use own, modified Propositional Logic Parser. Here are boolean functions:  


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def NOT(l1):
    indexes = l1
    ret_list = []
    for x in range(len(doc_list)):
        if x not in indexes:
            ret_list.append(x)
    return ret_list

def OR(l1, l2):
    tmp0 = set(l1)
    tmp1 = set(l2)
    indexes = tmp0.union(tmp1)
    ret_list = []
    for x in indexes:
        ret_list.append(x)
    return ret_list

def AND(l1, l2):
    tmp0 = set(l1)
    tmp1 = set(l2)
    indexes = tmp0.intersection(tmp1)
    ret_list = []
    for x in indexes:
        ret_list.append(x)
    return ret_list

This is the parser part:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from porpositional_logic_eval import *
import re
def query_tree_evaluate(tree):
    opers = {'||': OR, '&&': AND, '~': NOT} 
    leftT = tree.getLeftChild()
    rightT = tree.getRightChild()
    #pdb.set_trace()
    if leftT and not rightT:
        fn = opers[tree.getRootVal()]
        return fn(query_tree_evaluate(leftT))
    elif leftT and rightT:
        fn = opers[tree.getRootVal()]
        return fn(query_tree_evaluate(leftT), query_tree_evaluate(rightT))
    else:
        return tree.getRootVal()
    
def build_query_parse_tree(exp):
    exp_list = exp.replace('(', ' ( ').replace(')', ' ) ').replace('~', ' ~ ').split()
    e_tree = BinaryTree('')
    current_tree = e_tree
    for token in exp_list:
        if token == '(':
                current_tree.insertLeft('')
                current_tree = current_tree.getLeftChild()
        elif token in ['||','&&', '->', '==', 'XR']:
            if current_tree.getRootVal() == '~':
                current_tree.getParent().setRootVal(token)
                current_tree.insertRight('')
                current_tree = current_tree.getRightChild()
            else:
                current_tree.setRootVal(token)
                current_tree.insertRight('')
                current_tree = current_tree.getRightChild()
        elif token == '~':
            current_tree.setRootVal('~')
            current_tree.insertLeft('')
            current_tree = current_tree.getLeftChild()
        elif token == ')':
            current_tree = current_tree.getParent()
        elif re.search('[a-zA-z]', token):
            current_tree.setRootVal(inv_index[find_key(token, inv_index)])
            current_tree = current_tree.getParent()
            if current_tree.getRootVal() == '~':
                current_tree = current_tree.getParent()
        else:
            raise ValueError
    return e_tree

Finally, checking how it works, query is:

1
exp = "((i && caesar) && ~julius)"

Making a tree:


1
2
3
4
5
6
7
8
9
tr = build_query_parse_tree(exp)
inorder_traversal(tr)
# output:
[0, 1]
&&
[0, 1]
&&
[0]
~

As seen parser does the job, and asking it:


1
2
3
query_tree_evaluate(tr)
# output:
[1]

Gives as what we expected: the second document (contains 'caesar' and 'i', but not 'julius').     Coplexity, as seen is linear, so it should be pretty quick.  
Hopefully updates here soon! Code, as usually on github. Thank you!


No comments: