About the removal of irrelevant parts of the pages when indexing the site

Question separation of relevant and useful content from the rest of the eye candy gets up quite often to those who collect any information on the Web.

I think there is no reason to stay on the parsing algorithm in the HTML tree, especially because in the aggregate such parsers are taught to write the exchange rate on 3-4 of the University. Conventional stack, a little fishechek to pass arguments (except those that then need), and the output tree as a result of parsing. The text is broken into words in the process of parsing and the words are in a separate list, where minded in addition to General information and word position in the document. Of course the words in the list are already in 1st normal form, about morphology already wrote, it is simply copied from the previous article.


based on the morphological dictionary of Zaliznyak to select the most base, cut off the end, substitute 1-th dictionary form. This whole process was assembled into a tree for faster parsing by letter, in the final leaves contain endings. Run along the way, in parallel down the tree based on the meet-up letter, until we reach the lowest possible sheet – there, based on the endings, we substitute normalized.
If the normal form is not found, then applied, stemming – based on the text of books downloaded from lib.ru I built a table of frequency of occurrence of endings, looking for the most common of the suitable (also a tree) and replaced by a normal form. Stemming works well if a word was in the language of even 5-10 years ago – easy to disassemble "the crawler", "crawler"


After much experimentation with HTML parsing drew attention to the fact that identical blocks in the HTML are obviously identical subtrees – roughly speaking if you have 2 pages and 2 wood in between them to make the XOR it will be just what they need. Or if so is easier – the intersection of most of these trees at a single site will give a probabilistic model – the more trees met the unit, the less its importance. All met by more than 20% -30% throw, it makes no sense to waste time on duplicate content.

Obviously born and the solution: learn to count some CRC from the subtree, each subtree is then easy to calculate the number of repetitions. Then when re-parsing to reset the top of the tree which was found all too often – easily, and with the remaining wood you can always back to collect the text of the page (although this is not required and in fact anywhere else).

So 2 run on all pages – first, collect statistics, then indexim – a question of aggregating patterns is easily solved. Receive, in addition, a lot of advantages – construction of type <td></td>, <b></b> and other nonsense will be thrown out in the first place

Full contents and list my articles will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Vkontakte sync with address book for iPhone. How it was done

Automatically create Liquibase migrations for PostgreSQL

What part of the archived web