Everything you need to know about painless contexts

Welcome to the fourth part of the Workshop. As usual, to keep each article as compact as possible, I will shortcut the queries to snippets. If you want to see the complete code, please consult my GitHub page for this workshop.

This workshop will be different than the others. We need to look at painless contexts. Right in the second part of the scripting workshop. As soon as you write scripts – inline or stored, called by pipelines, search templates, or updates – you will struggle with contexts. Every time you read, create or update fields – and therefore it makes sense to get a clear picture right from the very start. With that said: let’s jump right in!


Be prepared for confusion!

A context provides variables and fields, classes and methods, and what kind of values can be returned. So to speak: your script may run perfectly in a pipeline, but it fails in a runtime field. And even worse; your stored script works flawlessly in your pipeline, but if you use it for an update, it fails. To get an idea of how many contexts are existing, please visit the official documentation.

To make it short: a context provides and sets the boundaries in which your script will operate.

We will not cover all contexts, simply there are too many of them. But we will cover the ones we used in the workshop so far:

  • ingest-processor (pipelines)
  • update
  • update_by_query
  • reindex
  • runtime fields
  • fields

Table of de-confusion

Please use the following table as a short reference.

Please consult the official documentation for the standard API for the classes that are in all contexts available (at least the one listed above) and the specialized ingest API for ingest-processors / specialized fields API for runtime-fields/runtime-mappings.

the challenge

To show you how to handle the different contexts, we will solve the same challenge in all the contexts. This is our data:

PUT companies/_doc/1
{ "ticker_symbol" : "ESTC",
  "market_cap" : 8000000000,
  "share_price" : 82.5 }

We have the market cap and we have the share price, now we will use the different APIs to calculate the outstanding shares.

ingest-processor-context

Pipelines can access the source of a document direct via the “ctx-map” variable and by “dot-notion”:

double outstanding = ctx.market_cap / ctx.share_price;
ctx.outstanding = (long)outstanding

The fields can also be addressed with “ctx[field-name]”, called the “bracket-notion”:

double outstanding = ctx['market_cap'] / ctx['share_price'];
ctx['outstanding'] = (long)outstanding

What’s the difference, you might ask? The “bracket-notion” allows greater flexibility. Fields like ctx[‘a b’] are possible, while the “dot-notion” prevents a call like “ctx.a b”. To be safe, use the “dot-notion” as much as possible.

If you want to use Java classes, please check the ingest API for package java.lang and the shared API for supported classes. Painless supports quite an extensive amount of Java classes.

Consult the official documentation for more information and which methods and APIs are supported.

update- and update_by_query-context

The update-, update_by_query- and the reindex-contexts use the map “ctx._source” to access the document fields:

double outstanding = ctx._source.market_cap / ctx._source.share_price;
ctx._source.outstanding = (long)outstanding

The update-context also provides the variable “ctx.now” with the current timestamp. update_by_query and reindex do not provide this variable.

The update-, update_by_query- and the reindex-contexts are providing the special variable “op”. Which lets you delete the document if needed:

"script" : {
  "source": """
  ctx.op = 'delete'      
  """
}

For cleaning up data, for example, documents with empty company fields, a test on them and an additional delete with the “ctx.op = delete” would be a practical use-case for a update_by_query.

Please check the examples for update- and update_by_query on the GitHub-page for this workshop

reindex-context

The reindex-context does not provide any further variables or methods other than the update or update_by_query does. Here is just an example for a script that accesses the “ctx._source”-map by the “dot-notion”:

POST _reindex
{ "source": {
    "index": "companies" },
  "dest": {
    "index": "companies_new" },
  "script": {
    "source": 
  """
  double outstanding = ctx._source.market_cap / ctx._source.share_price;
  ctx._source.outstanding_reindexed = (long)outstanding 
  """
} }

runetime_field-context

The runtime_field-context uses “doc-map” for accessing document fields. This map is read-only.

  "runtime_mappings": {
    "outstanding": {
      "type": "long",
      "script": {
        "lang": "painless", 
        "source": 
"""
long result;
double outstanding = doc.market_cap.value / doc['share_price'].value;
result = (long)outstanding; 
emit(result);
"""
      }
    }
  },
  "fields": ["outstanding"]

The runtime_field-context is the only one that uses the emit-method for returning results. emit can’t return null-values and at least one object must be returned, therefore test the values before you emit them.

fields-context

Scripted fields are very similar to the runtime-field context. However, grok and dissect patterns are not available – runtime_fields do provide these methods.

GET companies/_search
{
  "script_fields": {
    "free_float": {
      "script": {
        "source": 
"""
long result;
double outstanding = doc.market_cap.value / doc['share_price'].value;
result = (long)outstanding; 
return (result)
"""
} } } }

If there was only one object created, the return method is not needed. If more than one object was created, the last one that has been changed will be returned. To prevent confusion, use the return method.

Scripted fields also don’t need the fields parameter to display the returned objects – unlike runtime-mappings.


Conclusion

If you made it here: Congratulations! You should now be able to navigate through the different contexts. Don’t hesitate to read the official documentation if your script fails, the characteristics of the different contexts can be confusing!

If Google brought you here, you might check also the start of the series, or the whole series.

And as always: the full queries and code is available on the GitHub page for this workshop

Schreibe einen Kommentar