これはClassi Advent Calendar 2020の18日目の記事です。よろしくお願いします。

Classiでサーバサイドエンジニアをしている@s_nakamuraです。今年はあまりElasticsarchについて触れることが少なかったので、また定期的に触れて行こうと思います。今回紹介するのは、困ったときに使ってみるのが良さそうなAPIについてです。

Explain API

「なんか幾ら検索してもデータがヒットしないなー。どうしてだろう？」や「このXXXって文字だったら検索に出てくるのにYYYだと出てこないのはどうしてですか？」ということありませんか？ありますよね。そんな時はExplain API を使ってはどうでしょう。

例えばあるqueryでscoreの最小値を定義していたとします。Queryの修正した後に今まで検索でヒットしていたデータが出てこなくなった。そんな時に以下のようにExplain apiを使えば実際データがヒットしているのか、ヒットしているならElasticsearch側でどのようにscoreが計算されているのか分かります。

$curl -X GET "localhost:9200/book_index/_explain/210?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match" : { "title" : "星人" }
  }                  
}
'

{
    "_index": "book_index",
    "_type": "_doc",
    "_id": "210",
    "matched": true,
    "explanation": {
        "value": 2.6731732,
        "description": "sum of:",
        "details": [
            {
                "value": 1.3365866,
                "description": "weight(title:星 in 13) [PerFieldSimilarity], result of:",
                "details": [
                    {
                        "value": 1.3365866,
                        "description": "score(freq=1.0), computed as boost * idf * tf from:",
                        "details": [
                            {
                                "value": 2.2,
                                "description": "boost",
                                "details": []
                            },
                            {
                                "value": 1.3862944,
                                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                "details": [
                                    {
                                        "value": 2,
                                        "description": "n, number of documents containing term",
                                        "details": []
                                    },
                                    {
                                        "value": 9,
                                        "description": "N, total number of documents with field",
                                        "details": []
                                    }
                                ]
                            }.
  {
                                "value": 0.43824703,
                                "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                "details": [
                                    {
                                        "value": 1.0,
                                        "description": "freq, occurrences of term within document",
                                        "details": []
                                    },
                                    {
                                        "value": 1.2,
                                        "description": "k1, term saturation parameter",
                                        "details": []
                                    },
                                    {
                                        "value": 0.75,
                                        "description": "b, length normalization parameter",
                                        "details": []
                                    },
                                    {
                                        "value": 4.0,
                                        "description": "dl, length of field",
                                        "details": []
                                    },
                                    {
                                        "value": 3.6666667,
                                        "description": "avgdl, average length of field",
                                        "details": []
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "value": 1.3365866,
                "description": "weight(title:人 in 13) [PerFieldSimilarity], result of:",
                "details": [
                    {
                        "value": 1.3365866,
                        "description": "score(freq=1.0), computed as boost * idf * tf from:",
                        "details": [
                            {
                                "value": 2.2,
                                "description": "boost",
                                "details": []
                            },
                            {
                                "value": 1.3862944,
                                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                "details": [
                                    {
                                        "value": 2,
                                        "description": "n, number of documents containing term",
                                        "details": []
                                    },
                                    {
                                        "value": 9,
                                        "description": "N, total number of documents with field",
                                        "details": []
                                    }
                                ]
                            },
                            {
                                "value": 0.43824703,
                                "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                "details": [
                                    {
                                        "value": 1.0,
                                        "description": "freq, occurrences of term within document",
                                        "details": []
                                    },
                                    {
                                        "value": 1.2,
                                        "description": "k1, term saturation parameter",
                                        "details": []
                                    },
                                    {
                                        "value": 0.75,
                                        "description": "b, length normalization parameter",
                                        "details": []
                                    },
                                    {
                                        "value": 4.0,
                                        "description": "dl, length of field",
                                        "details": []
                                    },
                                    {
                                        "value": 3.6666667,
                                        "description": "avgdl, average length of field",
                                        "details": []
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

上の例の場合だとscoreの計算に様々な処理が入っていることが分かります。"description": "sum of:",という記述があるように今回のQueryでは各score計算処理で算出されたscoreの合計値を合算してドキュメントのscoreとしています。上で書いた例のようにQueryでscoreの最小値を指定している場合に検索にヒットすると思っていたドキュメントが実は想定していたほどscoreが出ていなかったなどあるかもしれません。該当ドキュメントが検索結果に出てこない時、Explain apiで確認してみてはどうでしょうか

Validate API

このQueryで問題なく動くのか？それを確認したい場合はValidation APIを使うとQueryのチェックをしてくれます。

curl -X GET "localhost:9200/test-index/_doc/_validate/query?explain=true&pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "bool" : {
      "must" : {
        "query_string" : {
          "querys" : "title:1"
        }
      }
    }
  }
}
'
{
  "valid" : false,
  "error" : "ParsingException[Failed to parse]; nested: XContentParseException[[7:22] [bool] failed to parse field [must]]; nested: ParsingException[[query_string] query does not support [querys]];; org.elasticsearch.common.xcontent.XContentParseException: [7:22] [bool] failed to parse field [must]"

Queryが間違っていれば、上のようにエラー表示されす。正しいQueryであれば以下のようなresponseが返ります。クエリパラメータに「explain=true」をつけることでexplanations以下の項目が出力されます。

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test-index,
      "valid" : true,
      "explanation" : "+(+title:1) #DocValuesFieldExistsQuery [field=_primary_term]"
    }
  ]
}

思った通りの検索が出来ない時、実行しているQueryが正しく動作するかチェックしたい時にこのAPIを使うと解決の糸口になるかもしれません。

Profile API

実行したQueryがどのくらいパフォーマンスを出せているのか？それを知るためにProfile APIを使ってみてはどうでしょうか？

curl -X GET "localhost:9200/albums/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "profile": true,
  "query" : {
    "match" : { "title" : "星人" }
  }
}
'

検索APIに"profile": trueを追加します。

{
  "took" : 1223,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.6731732,
    "hits" : [
      {
        "_index" : "albums",
        "_type" : "_doc",
        "_id" : "210",
        "_score" : 2.6731732,
        "_source" : {
          "id" : 210,
          "title" : "ホゲホゲ星人の冒険",
          "user_id" : 62,
          "created_at" : "2020-12-13T02:52:03.000Z",
          "updated_at" : "2020-12-13T02:52:03.000Z",
          "photos" : [
            {
              "id" : 89,
              "image" : "#<Rack::Test::UploadedFile:0x0000560c33207048>",
              "description" : "冒険の記録1",
              "user_id" : 62,
              "group_id" : 0,
              "album_id" : 210,
              "photo_geo_id" : null,
              "good_point" : 0,
              "created_at" : "2020-12-13T02:52:03.000Z",
              "updated_at" : "2020-12-13T02:52:03.000Z",
              "description2" : null
            }
          ],
          "title2" : "ホゲホゲ星人の冒険",
          "tags" : [
            {
              "id" : 62,
              "label_name" : "犬",
              "album_id" : 210,
              "group_id" : 0,
              "created_at" : "2020-12-13T02:52:03.000Z",
              "updated_at" : "2020-12-13T02:52:03.000Z"
            }
          ],
          "total_point" : 0
        }
      },
      {
        "_index" : "albums",
        "_type" : "_doc",
        "_id" : "203",
        "_score" : 2.0209837,
        "_source" : {
          "id" : 203,
          "title" : "映画仮面ライダーとホゲホゲ星人の戦い",
          "user_id" : 61,
          "created_at" : "2020-12-13T02:52:03.000Z",
          "updated_at" : "2020-12-13T02:52:03.000Z",
          "photos" : [
            {
              "id" : 87,
              "image" : "#<Rack::Test::UploadedFile:0x0000560c3330b3e0>",
              "description" : "ホゲホゲ星人との場面1",
              "user_id" : 61,
              "group_id" : 0,
              "album_id" : 203,
              "photo_geo_id" : null,
              "good_point" : 0,
              "created_at" : "2020-12-13T02:52:03.000Z",
              "updated_at" : "2020-12-13T02:52:03.000Z",
              "description2" : null
            }
          ],
          "title2" : "映画仮面ライダーとホゲホゲ星人の戦い",
          "tags" : [
            {
              "id" : 61,
              "label_name" : "犬",
              "album_id" : 203,
              "group_id" : 0,
              "created_at" : "2020-12-13T02:52:03.000Z",
              "updated_at" : "2020-12-13T02:52:03.000Z"
            }
          ],
          "total_point" : 0
        }
      }
    ]
  },
  "profile" : {
    "shards" : [
      {
        "id" : "[g6sJGk0mTB6vV5yAPPIfMw][albums][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "title:星 title:人",
                "time_in_nanos" : 109041800,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 2,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 151300,
                  "match" : 31300,
                  "next_doc_count" : 2,
                  "score_count" : 2,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 288300,
                  "advance_count" : 1,
                  "score" : 206300,
                  "build_scorer_count" : 2,
                  "create_weight" : 20768900,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 87595700
                },
                "children" : [
                  {
                    "type" : "TermQuery",
                    "description" : "title:星",
                    "time_in_nanos" : 7838400,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 3,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 2,
                      "compute_max_score_count" : 3,
                      "compute_max_score" : 2323200,
                      "advance" : 42600,
                      "advance_count" : 3,
                      "score" : 59500,
                      "build_scorer_count" : 3,
                      "create_weight" : 464000,
                      "shallow_advance" : 156300,
                      "create_weight_count" : 1,
                      "build_scorer" : 4792800
                    }
                  },
                  {
                    "type" : "TermQuery",
                    "description" : "title:人",
                    "time_in_nanos" : 1237500,
                    "breakdown" : {
                      "set_min_competitive_score_count" : 0,
                      "match_count" : 0,
                      "shallow_advance_count" : 3,
                      "set_min_competitive_score" : 0,
                      "next_doc" : 0,
                      "match" : 0,
                      "next_doc_count" : 0,
                      "score_count" : 2,
                      "compute_max_score_count" : 3,
                      "compute_max_score" : 64700,
                      "advance" : 54200,
                      "advance_count" : 3,
                      "score" : 33600,
                      "build_scorer_count" : 3,
                      "create_weight" : 872000,
                      "shallow_advance" : 58600,
                      "create_weight_count" : 1,
                      "build_scorer" : 154400
                    }
                  }
                ]
              }
            ],
            "rewrite_time" : 143100,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 2089300
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
  }
}

検索結果の後の"profile"が今回のQueryに関するprofileの結果です。Queryセクションの中のtime_in_nanosでそのQueryに掛かった時間を示しています。breakdown以下に詳細であらわしています。childrenセクションでsub Queryの分析結果をあらわしています。 breakdownの内容はLuceneのlow levelの項目になります。 Queryの実行時にどのような処理にどのくらい時間が掛かっているのか、Luceneでどのクラスが使われているのかなど詳細を知るのにはProfile apiを使ってみると良さそうです。

以上です。 Elasticsearchは機能が豊富で様々なAPIや機能があります。個人的には非同期検索も面白そうだなと思っていて、今度試してみようと思います。

明日はhxrxchangさんです。

nakaearthの日記

Elasticsearch 困ったときに使ってみると良さそうな3つのAPI

Explain API

Validate API

Profile API