Elasticsearch の日本語プラグイン「Sudachi」のフィルターについて調べてみた

Sudachiとは

Elasticsearch で日本語形態素解析を行えるプラグラインに、「kuromoji」、「Sudachi」があり、この２つは Amazon OpenSearch Service でも使用できるプラグインです。
Sudachiは、2023/10/17からOpenSearchでプラグインとして使用できる様になりました。
Sudachiのインストール手順はAWSのブログで紹介されています。
今回は Sudachi のフィルターについて調べてみました。

フィルターとは

Sudachiであらかじめ準備されている、日本語形態素解析によって分割された単語(トークン)に対して様々な処理を行うコンポーネントです。

sudachi_part_of_speech

指定した品詞の単語(トークン)を除外します。
Sudachiのstoptags.txtを見ると指定できる品詞が確認できます。
「吾輩は猫である。名前はまだ無い。」という文章を、Sudachiで日本語形態素解析した時の品詞は下記の様になりました。

吾輩 ('代名詞', '*', '*', '*', '*', '*')
は ('助詞', '係助詞', '*', '*', '*', '*')
猫 ('名詞', '普通名詞', '一般', '*', '*', '*')
で ('助動詞', '*', '*', '*', '助動詞-ダ', '連用形-一般')
ある ('動詞', '非自立可能', '*', '*', '五段-ラ行', '終止形-一般')
。 ('補助記号', '句点', '*', '*', '*', '*')
名前 ('名詞', '普通名詞', '一般', '*', '*', '*')
は ('助詞', '係助詞', '*', '*', '*', '*')
まだ ('副詞', '*', '*', '*', '*', '*')
無い ('形容詞', '非自立可能', '*', '*', '形容詞', '終止形-一般')
。 ('補助記号', '句点', '*', '*', '*', '*')

sudachi_part_of_speechを使い、助詞と動詞を指定し、
「吾輩は猫である。名前はまだ無い。」の文章に対するOpenSearchの解析結果を確認します。

・インデクス定義
PUT /test_sudachi_part_of_speech
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["my_posfilter"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_posfilter":{
            "type":"sudachi_part_of_speech",
            "stoptags":[
              "助詞",
              "動詞"
            ]
          }
        }
      }
    }
  }
}
・解析実行
POST test_sudachi_part_of_speech/_analyze
{
  "analyzer": "sudachi_analyzer",
  "text": "吾輩は猫である。名前はまだ無い。"
}
・解析結果
{
  "tokens": [
    {
      "token": "吾輩",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "猫",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "で",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 3
    },
    {
      "token": "名前",
      "start_offset": 8,
      "end_offset": 10,
      "type": "word",
      "position": 5
    },
    {
      "token": "まだ",
      "start_offset": 11,
      "end_offset": 13,
      "type": "word",
      "position": 7
    },
    {
      "token": "無い",
      "start_offset": 13,
      "end_offset": 15,
      "type": "word",
      "position": 8
    }
  ]
}

助詞の「は」、動詞「ある」が除外されています。

sudachi_ja_stop

指定されたストップワードを除外します。
sudachi_ja_stopを使い、「は」、「で」、「ある」、「まだ」を指定し、
「吾輩は猫である。名前はまだ無い。」の文章に対するOpenSearchの解析結果を確認します。

・インデクス定義
PUT /test_sudachi_ja_stop
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["my_stopfilter"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_stopfilter":{
            "type":"sudachi_ja_stop",
            "stopwords":[
              "_japanese_",
              "は",
              "で",
              "ある",
              "まだ"
            ]
          }
        }
      }
    }
  }
}
・解析実行
POST test_sudachi_ja_stop/_analyze
{
  "analyzer": "sudachi_analyzer",
  "text": "吾輩は猫である。名前はまだ無い。"
}
・解析結果
{
  "tokens": [
    {
      "token": "吾輩",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "猫",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "名前",
      "start_offset": 8,
      "end_offset": 10,
      "type": "word",
      "position": 5
    },
    {
      "token": "無い",
      "start_offset": 13,
      "end_offset": 15,
      "type": "word",
      "position": 8
    }
  ]
}

指定されたストップワードが除外されていることが確認できます。

sudachi_baseform

動詞と形容詞を終止形に変換します。
下記の例でOpenSearchの解析結果を確認します。

・インデクス定義
PUT /test_sudachi_baseform
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["sudachi_baseform"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}
・解析実行
POST test_sudachi_baseform/_analyze
{
  "analyzer": "sudachi_analyzer",
  "text": "飲み　泳ぎ　高く"
}
・解析結果
{
  "tokens": [
    {
      "token": "飲む",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "泳ぐ",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "高い",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 2
    }
  ]
}

「飲み」→「飲む」、「泳ぎ」→「泳ぐ」、「高く」→「高い」と終止形に変換されているこを確認できます。

sudachi_normalizedform

Sudachi 正規化形式に変換します。
正規化形式には下記の様なものがあるようです。
・送り違い
「終る」→「終わる」
「変る」→「変わる」
・字種
けもり、ケムリ、煙
・異体字(旧字体と新字体)
「附属」→「付属」
「障碍者」→「障害者」
「檢査」→「検査」
・誤用
「アッピール」 → 「アピール」
「コミュニティ」 → 「コミュニティー」
「コンピュータ」 → 「コンピューター」

下記の例でOpenSearchの解析結果を確認します。

・インデクス定義
PUT /test_sudachi_normalizedform
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["sudachi_normalizedform"],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}
・解析実行
POST test_sudachi_normalizedform/_analyze
{
  "analyzer": "sudachi_analyzer",
  "text": "変る　終る　けむり　ケムリ　附属　障碍者　檢査　アッピール　コミュニティ　コンピュータ"
}
・解析結果
{
  "tokens": [
    {
      "token": "変わる",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "終わる",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 2
    },
    {
      "token": "煙",
      "start_offset": 9,
      "end_offset": 12,
      "type": "word",
      "position": 3
    },
    {
      "token": "煙",
      "start_offset": 13,
      "end_offset": 16,
      "type": "word",
      "position": 4
    },
    {
      "token": "付属",
      "start_offset": 17,
      "end_offset": 19,
      "type": "word",
      "position": 5
    },
    {
      "token": "障害者",
      "start_offset": 20,
      "end_offset": 23,
      "type": "word",
      "position": 6
    },
    {
      "token": "検査",
      "start_offset": 24,
      "end_offset": 26,
      "type": "word",
      "position": 7
    },
    {
      "token": "アピール",
      "start_offset": 27,
      "end_offset": 32,
      "type": "word",
      "position": 8
    },
    {
      "token": "コミュニティー",
      "start_offset": 33,
      "end_offset": 39,
      "type": "word",
      "position": 9
    },
    {
      "token": "コンピューター",
      "start_offset": 40,
      "end_offset": 46,
      "type": "word",
      "position": 10
    }
  ]
}

sudachi_readingform

カタカナ、もしくはローマ字の読みに変換します。
下記の例でOpenSearchの解析結果を確認します。

・インデクス定義
PUT /test_sudachi_readingform
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "romaji_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": true
          },
          "katakana_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": false
          }
        },
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "romaji_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": ["romaji_readingform"]
          },
          "katakana_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": ["katakana_readingform"]
          }
        }
      }
    }
  }
}
・解析実行(カタカナ変換指定)
POST test_sudachi_readingform/_analyze
{
  "analyzer": "katakana_analyzer",
  "text": "寿司　おおきく　不動産　美津濃"
}
・解析結果(カタカナ変換指定)
{
  "tokens": [
    {
      "token": "スシ",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "オオキク",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "フドウサン",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 2
    },
    {
      "token": "ミズノ",
      "start_offset": 12,
      "end_offset": 15,
      "type": "word",
      "position": 3
    }
  ]
}
・解析実行(ローマ字読み変換指定)
POST test_sudachi_readingform/_analyze
{
  "analyzer": "romaji_analyzer",
  "text": "寿司　おおきく　不動産　美津濃"
}
・解析結果(ローマ字読み変換指定)
{
  "tokens": [
    {
      "token": "susi",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "ookiku",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "hudousan",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 2
    },
    {
      "token": "mizuno",
      "start_offset": 12,
      "end_offset": 15,
      "type": "word",
      "position": 3
    }
  ]
}

おわりに

Sudachiのフィルターについて、色々文章、単語を解析して結果を確認して調査しました。
フィルターは複数同時に使用することができますが、SudachiのGithubのREADMEにも記載されている通り、sudachi_baseform、sudachi_normalizedform、sudachi_readingformはそれぞれ上書きされる関係のため、どれか一つを使うことになると思います。

Elasticsearch の日本語プラグイン「Sudachi」のフィルターについて調べてみた

Sudachiとは

フィルターとは

目次

sudachi_part_of_speech

sudachi_ja_stop

sudachi_baseform

sudachi_normalizedform

sudachi_readingform

おわりに

2025年度版！AWS資格取得の順番について！

【Google Cloud Next ’25】ソフトウェアテストの未来を垣間見た！ Gemini Code Assist agentsが開発現場にもたらす革新

Amazon EC2がステータスチェックに失敗する理由

(EC2) insufficient capacityになる理由

生成AIに画像からHTMLを書いてもらおう！

Elasticsearch の 日本語プラグイン「Sudachi」の フィルター について 調べてみた

Sudachiとは

フィルターとは

目次

sudachi_part_of_speech

sudachi_ja_stop

sudachi_baseform

sudachi_normalizedform

sudachi_readingform

おわりに

関連記事Related Articles

OpenSearchの再構築で詰まった話

OpenSearch の手動スナップショットとリストア手順の紹介

【入門】Amazon OpenSearch Serviceを使ってみました

AWS、Azureなどで利用できるフロントエンド、SpaceBlockの設定方法

EMRってなんじゃ？（ImpalaでCloudfrontの爆速ログ集計）

Elasticsearch の日本語プラグイン「Sudachi」のフィルターについて調べてみた