Implementation of search engine ranking in Python (Part 1)
Looking through the news feed I stumbled upon a recommendation from the Typical Programmer article "Implementing a Search Engine with Ranking in Python", written by Aakash Japi. It interested me, like material in the Runet is not very much, and I decided to translate it. As it is quite large, I will divide it into 2-3 parts. That, I end my entry and go to the translation.
Every time I use Quora, I ultimately see at least the issue seems to ATAGO: someone asked how Google works and how they could surpass him in search of information. Most of the questions are not so bold and misleading, like this one, but they all Express a similar feeling, and that they convey a significant lack of understanding of how search engines work.
But while Google is incredibly complex, the basic concept is the search engines who are looking for compliance and evaluate (rank) the results relative to the search request is not particularly difficult, and it can understand any basic programming experience. I don't think is currently possible to beat Google in search, but to make a search engine is an achievable goal, and in fact it is a very instructive exercise, which I recommend to try.
That's what I'll describe in this article: how to make a search engine for local text files, for which it is possible to process standard requests (at least one of the words in the query are in the document) and the phrase (the entire phrase appears in the text), and can range using a basic TF-IDF scheme.
There are two basic stages in the development of the search engine: the index is built, and then using the index to answer the query. And then we can add the result of the ranking (TF-IDF, PageRank, etc.), classification of a query/document, and maybe some machine learning to track the most recent requests of the user and on this basis to select the results to improve the performance of search system.
So, without further ADO, let's begin!
Thus, the first step in building a text search engine is an Assembly of an inverted index. Let me explain what it is. Inverted index is a data structure that maps tokens to documents in which they appear. In this context, we can consider the marker just as the words, therefore, inverted index, at its core, it's something that takes a word and returns us a list of documents where it occurs.
First, however, we have to analyze and mark (mark, dividing the words) our corpus of documents. We will do this in the following way: for each document we want to add to our index, we remove all punctuation and split it on spaces, will create a temporary hash table that maps document names to the list of markers. We convert the hash table until, until we reach finally the inverted index, which I described above (but with a slight complication, which I will explain later). Here is the code that will do the initial filtering of the text:
the
There are two things that I did not, but I recommend to make. Remove all stop words ("as", "and", "to", etc., which do not add to the relevance of a document) and convert all the words (so "running" and "runs" becomes "run") using an external library (although this will slow down indexing).
Now I know that I mentioned the inverted index will map the word to the name of the document, but we also want to support queries with phrases: queries not only for words but also for words in a specific sequence. For this we need to know where in the document each word appears, so we can check the order of the words. I index each word in a bulleted list to the document as the position of words in the document, so our final inverted index will look as follows:
the
instead of this:
the
Thus, our first task is to create a mapping of words to their positions for each document, and then combine them to create our complete inverted index. It looks like:
the
This code is quite clear: it takes a list of terms in the document, separated by a space (in which the words are in their original order), and adds each to a hash table where the value is the list position of the word in the document. We build this list repeatedly as we go through the list, until then, until we get all the words, leaving us with the table, are keyed by row and spaced to the list of positions of these lines.
Now we need to combine these hash tables. I started this with the creation of the intermediate format index
the
we then transform to our final index. This is done here:
the
This code is very simple: it just receives the result of the function file_to_terms, and creates a new hash table is marked with a key file name and with the values that are the result of previous functions, creating a nested hash table.
Then, we can actually build our inverted index. Here is the code:
the
So, let's examine it. First, we create a hash table (Python dictionary), and we use two nested loops to iterate through each word in the input (input) hash function. Then, we first check if the word is present as key in the output (output) the hash table. If not, then we will add it, setting the value of another hash table, which maps a document (identified, in this case, the variable filename) to the list of positions of the word.
If this is the key, then we perform another check: if the current document is in each hash table (the one that maps the file names with the position of the word). If this is the key, we do another check: if the current document in a hash table each word (the one that maps the file names for the positions of the words). If so, we are expanding the list of the current positions of the list items (please note that this case left only for completeness: this will never happen, because every word will have only one list of items for each file(filename)). If not, then we set equal to the position in the position list for this file (filename).
And now we have the index. We can enter a word, and needs to obtain the list of documents where it occurs. following article I'll show you how to query this index.
All code used in all parts of (implementation) available on GitHub.
PS this part ends. I hope I translated everything quite clear, and most importantly — correctly. Willing to accept comments and advice regarding the registration and transfer.
Article based on information from habrahabr.ru
Every time I use Quora, I ultimately see at least the issue seems to ATAGO: someone asked how Google works and how they could surpass him in search of information. Most of the questions are not so bold and misleading, like this one, but they all Express a similar feeling, and that they convey a significant lack of understanding of how search engines work.
But while Google is incredibly complex, the basic concept is the search engines who are looking for compliance and evaluate (rank) the results relative to the search request is not particularly difficult, and it can understand any basic programming experience. I don't think is currently possible to beat Google in search, but to make a search engine is an achievable goal, and in fact it is a very instructive exercise, which I recommend to try.
That's what I'll describe in this article: how to make a search engine for local text files, for which it is possible to process standard requests (at least one of the words in the query are in the document) and the phrase (the entire phrase appears in the text), and can range using a basic TF-IDF scheme.
There are two basic stages in the development of the search engine: the index is built, and then using the index to answer the query. And then we can add the result of the ranking (TF-IDF, PageRank, etc.), classification of a query/document, and maybe some machine learning to track the most recent requests of the user and on this basis to select the results to improve the performance of search system.
So, without further ADO, let's begin!
Building index
Thus, the first step in building a text search engine is an Assembly of an inverted index. Let me explain what it is. Inverted index is a data structure that maps tokens to documents in which they appear. In this context, we can consider the marker just as the words, therefore, inverted index, at its core, it's something that takes a word and returns us a list of documents where it occurs.
First, however, we have to analyze and mark (mark, dividing the words) our corpus of documents. We will do this in the following way: for each document we want to add to our index, we remove all punctuation and split it on spaces, will create a temporary hash table that maps document names to the list of markers. We convert the hash table until, until we reach finally the inverted index, which I described above (but with a slight complication, which I will explain later). Here is the code that will do the initial filtering of the text:
the
def process_files(self):
file_to_terms = {}
for file in self.filenames:
pattern = re.compile('[\W_]+')
file_to_terms[file] = open(file, 'r').read().lower();
file_to_terms[file] = pattern.sub(' ',file_to_terms[file])
re.sub(r'[\W_]+',", file_to_terms[file])
file_to_terms[file] = file_to_terms[file].split()
return file_to_terms
There are two things that I did not, but I recommend to make. Remove all stop words ("as", "and", "to", etc., which do not add to the relevance of a document) and convert all the words (so "running" and "runs" becomes "run") using an external library (although this will slow down indexing).
Now I know that I mentioned the inverted index will map the word to the name of the document, but we also want to support queries with phrases: queries not only for words but also for words in a specific sequence. For this we need to know where in the document each word appears, so we can check the order of the words. I index each word in a bulleted list to the document as the position of words in the document, so our final inverted index will look as follows:
the
{word: {documentID: [pos1, pos2, ...]}, ...}, ...}
instead of this:
the
{word: [documentID, ...], ...}
Thus, our first task is to create a mapping of words to their positions for each document, and then combine them to create our complete inverted index. It looks like:
the
#input = [word1, word2, ...]
#output = {'word1': [pos1, pos2], word2: [pos2, pos434], ...}
def index_one_file(termlist):
fileindex is used to select = {}
for index, word in enumerate(termlist):
if fileindex is used to select word in.keys():
fileindex is used to select[word].append(index)
else:
fileindex is used to select[word] = [index]
return fileindex is used to select
This code is quite clear: it takes a list of terms in the document, separated by a space (in which the words are in their original order), and adds each to a hash table where the value is the list position of the word in the document. We build this list repeatedly as we go through the list, until then, until we get all the words, leaving us with the table, are keyed by row and spaced to the list of positions of these lines.
Now we need to combine these hash tables. I started this with the creation of the intermediate format index
the
{documentID: {word: [pos1, pos2, ...]}, ...}
we then transform to our final index. This is done here:
the
#input = {filename: [word1, word2, ...], ...}
#res = {filename: {word: [pos1, pos2, ...]}, ...}
def make_indices(termlists):
total = {}
for filename in termlists.keys():
total[filename] = index_one_file(termlists[filename])
return total
This code is very simple: it just receives the result of the function file_to_terms, and creates a new hash table is marked with a key file name and with the values that are the result of previous functions, creating a nested hash table.
Then, we can actually build our inverted index. Here is the code:
the
#input = {filename: {word: [pos1, pos2, ...], ... }}
#res = {word: {filename: [pos1, pos2]}, ...}, ...}
def fullIndex(regdex):
total_index = {}
for filename in regdex.keys():
for word in regdex[filename].keys():
if word in total_index.keys():
if filename in total_index[word].keys():
total_index[word][filename].extend(regdex[filename][word][:])
else:
total_index[word][filename] = regdex[filename][word]
else:
total_index[word] = {filename: regdex[filename][word]}
return total_index
So, let's examine it. First, we create a hash table (Python dictionary), and we use two nested loops to iterate through each word in the input (input) hash function. Then, we first check if the word is present as key in the output (output) the hash table. If not, then we will add it, setting the value of another hash table, which maps a document (identified, in this case, the variable filename) to the list of positions of the word.
If this is the key, then we perform another check: if the current document is in each hash table (the one that maps the file names with the position of the word). If this is the key, we do another check: if the current document in a hash table each word (the one that maps the file names for the positions of the words). If so, we are expanding the list of the current positions of the list items (please note that this case left only for completeness: this will never happen, because every word will have only one list of items for each file(filename)). If not, then we set equal to the position in the position list for this file (filename).
And now we have the index. We can enter a word, and needs to obtain the list of documents where it occurs. following article I'll show you how to query this index.
All code used in all parts of (implementation) available on GitHub.
PS this part ends. I hope I translated everything quite clear, and most importantly — correctly. Willing to accept comments and advice regarding the registration and transfer.
Комментарии
Отправить комментарий