Write a search plugin for Elasticsearch

Elaticsearch popular search server and a NoSQL database. One of the interesting features is support for plugins that can extend built-in functionality and add some business logic on the level of search. In this article I want to talk about how to write such a plugin and tests for it.

just want to mention that the task in this article has been greatly simplified in order not to clutter the code. For example, in one real application, the document is stored with a full schedule of exceptions, and based on them the script calculates the desired values. But I would like to focus on the plugin, so in the example very simple.

Also need to mention that I am not a committer Elasticsearch, presented information is mainly obtained by trial and error, and maybe something to be wrong.

So, suppose we have a document Event with properties start and stop which stores the time as a string in the format "HH:MM:SS". Task — for a given time of time to sort events so that the active event (start <= time <= stop) was at the beginning of the issue. An example of such a document:

the

{
"start": "09:00:00",
"stop": "18:30:00"
}

Plugin

For a basis I took an example from one of the developers of Elasticsearch. The plugin consists of one or more scripts that should be registered:

the

public class ExamplePlugin extends AbstractPlugin {
public void onModule(ScriptModule module) {
module.registerScript(EventInProgressScript.SCRIPT_NAME, EventInProgressScript.Factory.class);
}
}

Source code complete

The script consists of two parts: factory NativeScriptFactory, and the script that inherits AbstractSearchScript. The factory is engaged in the creation of the script (as well as the validation of parameters). It is worth noting that the script is created only 1 time for search (on each shard), so that the initialization/processing of parameters is done at this stage.

The client application must pass to the script parameters:
the

time — string in format "HH:MM:SS" time that we're interested in
use_doc — determines which method to use to access to the document (more on this later)

the

public static class Factory implements NativeScriptFactory {
@Override
public ExecutableScript newScript(@Nullable Map<String, Object> params) {
LocalTime time = params.containsKey(PARAM_TIME)
? new LocalTime(params.get(PARAM_TIME))
: null;
Boolean useDoc = params.containsKey(PARAM_USE_DOC)
? (Boolean) params.get(PARAM_USE_DOC)
: null;

if (time == null || useDoc == null) {
throw new ScriptException("Parameters \"time\" and \"use_doc\" are required");
}

return new EventInProgressScript(time, useDoc);
}
}

Source code complete

So, the script is created and ready to go. In the script the most important thing — the run () method:
the

@Override
public Integer run() {
Event event = useDoc
? parser.getEvent(doc())
: parser.getEvent(source());

return event.isInProgress(time)
? 1
: 0;
}

Source code complete

This method is called for each document, so should pay special attention to what is happening inside him, and how quickly. This has a direct impact on the performance of the plugin.

In the General case, the algorithm is:

Read the needed data of the document

the Computed result the

Return it in Elasticsearch

To access the data of the document using one of the methods source(), fields () or doc(). Looking ahead, I will say that doc() is much faster source() and, if possible, you should use it.

In this example, based on the data of the document, I create a model for further work.
the

public class Event {
public static final String START = "start";
public static final String STOP = "stop";

private final LocalTime start;
private final LocalTime stop;

public Event(LocalTime start, LocalTime stop) {
this.start = start;
this.stop = stop;
}

public boolean isInProgress(LocalTime time) {

&& (time.isBefore(stop) || time.isEqual(stop));
}
}

(in trivial cases, of course you can just use the data from the document and immediately return the result, and it would be faster)

The result in our case is "1" for the events now taking place (start <= time <= stop), and "0" for all others. The result type is Integer, because sort by Boolean Elasticsearch does not know how.

After processing the script for each document will be determined by the value that Elasticsearch them and sort. Mission accomplished!

Integration tests

Besides the fact that tests are good in themselves, it is also a great entry point for debugging. Very convenient to put a breakpoint and start the debug of the assessment. Without this debug plugin it would be very difficult.

Schema integration testing of the plugin is something like this:

test cluster
to Create the index and mapping
Add document
to ask the server to compute the value of a script for the specified parameters and document
make Sure the value is correct

To run the test servers use the base class ElasticsearchIntegrationTest. You can configure the number of nodes, shards and replicas. More info — on GitHub.

Perhaps, there are two ways to create test documents. The first is to build a document directly in the test example can be viewed SDAs. This option is quite good, and at first I used. However, the scheme documents change, and over time, it may be that the structure, built in the test no longer corresponds to reality. Therefore, the second approach is to store mapping dannie separately in the form of resources. In addition, this method makes it possible in case of unexpected results on live serever just copy the problem document as a resource and see how the test will fall. In General, any method is good, the choice is yours.

To query the result of a calculation script, use the standard Java client:
the

SearchResponse searchResponse = client()
.prepareSearch(TEST_INDEX).setTypes(TEST_TYPE)
.addScriptField(scriptName, "native", scriptName, scriptParams)
.execute()
.actionGet();

Source code complete

Integration with Travis-CI

An optional part of the program — integration with Continuous Integration system Travis. Add the file .travis:
the

language: java

jdk:
- openjdk7
- oraclejdk7

script:
- mvn test

and CI server will test your code after each change, it looks like this. A trifle, but nice.

Usage

So, the plugin is ready and tested. It's time to try it out.

Installation

About the installation of the plugins can be read in the official documentation. Plugin is in ./target. To facilitate local installation I wrote a small script that collects the plugin and install it:
the

mvn clean package
if [ $? -eq 0 ]; then
plugin-r plugin-example
plugin --install plugin-example --url file://`pwd`/`ls target/*.jar | head -n 1`
echo -e "\033[1;33mPlease restart Elasticsearch!\033[0m"
fi

Source code

The script is written for Mac/brew. For other systems, you may have to correct the path to the plugin file. In Ubuntu it is in /usr/share/elasticsearch/bin/plugin. After installing the plugin do not forget to restart Elasticsearch.

Test data

a Simple generator of test documents written in Ruby.
the

bundle install
./generate.rb

Test request

Ask Elasticsearch to sort all events according to the result of the script "in_progress":
the

curl-XGET "http://localhost:9200/demo/event/_search?pretty" -d'
{
"sort": [
{
"_script": {
"script": "in_progress",
"params": {
"time": "15:20:00",
"use_doc": true
},
"lang": "native",
"type": "number",
"order": "desc"
}
}
],
"size": 1
}'

The result:
the

{
"took" : 139,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"hits" : {
"total" : 86400,
"max_score" : null,
"hits" : [ {
"_index" : "demo",
"_type" : "event",
"_id" : "AUvf6fPPoRWAbGdNya4y",
"_score" : null,
"_source":{"start":"07:40:01","stop":"15:20:02"},
"sort" : [ 1.0 ]
} ]
}
}

you can See that the server has deemed the value to 86400 documents in 139 milliseconds. Of course, this
can't compare speed with the simple sort (2 MS), but still not bad for a laptop. In addition, scripts are run in parallel in different shards and thus massturbate.

Methods source() and doc()

As I wrote in the beginning of the script there are several ways to access the content of the document. This source(), fields () and doc(). Source() — easy and slow way. When you request a download only document in a HashMap. But then available for everything. Doc() — access to the indexed data, it's much faster, but to work with it a bit more difficult. First, we do not support Nested type, which imposes restrictions on the structure of the document. Second, the indexed data may differ from what is in the document itself, first of all it concerns strings. In an experiment the task can try to remove "index": "not_analyzed" in the mapping.json and see how it all breaks down. As for the method fields(), to be honest I never tried it, judging by the documentation it is slightly better source().

Now try to use the source () by changing parameter use_doc to false.

Request

curl-XGET "http://localhost:9200/demo/event/_search?pretty" -d'
{
"sort": [
{
"_script": {
"script": "in_progress",
"params": {
"time": "15:20:00",
"use_doc": false
},
"lang": "native",
"type": "number",
"order": "desc"
}
}
],
"size": 1
}'

And here already "took": 587 milliseconds, i.e. 4 times slower. In a real application with large documents, the difference can be hundreds of times.

Other applications of the script

The script of the plugin can be used not only for sorting, and in General in any application that supports scripts. For example, you can calculate the value for the found documents. In this case, by the way, the performance is not as important as calculations are performed for the filtered and limited set of documents.
the

curl-XGET "http://localhost:9200/demo/event/_search" -d'
{
"script_fields": {
"in_progress": {
"script": "in_progress",
"params": {
"time": "00:00:01",
"use_doc": true
},
"lang": "native"
}
},
"partial_fields": {
"properties": {
"include": ["*"]
}
}, 
"size": 1
}'

Result

{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 86400,
"max_score": 1,
"hits": [
{
"_index": "demo",
"_type": "event",
"_id": "AUvf6fO9oRWAbGdNyUJi",
"_score": 1,
"fields": {
"in_progress": [
1
],
"properties": [
{
"stop": "00:00:02",
"start": "00:00:01"
}
]
}
}
]
}
}

That's all, thanks for reading!
Source code on GitHub: github.com/s12v/elaticsearch-plugin-demo

PS by the Way, we really need experienced programmers and system administrators to work on major project based on the AWS/Elasticsearch/Symfony2. in Berlin. If you are interested, write!

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express