Skip to content

Commit 104f46a

Browse files
Merge branch 'develop' into release/6.4
# Conflicts: # pom.xml
2 parents 563d371 + f0c9d2f commit 104f46a

File tree

16 files changed

+1503
-79
lines changed

16 files changed

+1503
-79
lines changed

README.md

Lines changed: 125 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,21 @@
11
# analysis-sudachi
2-
analysis-sudachi is Elasticsearch plugin based on Sudachi the Japanese morphological analyzer.
2+
3+
analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.
34

45
[![Build Status](https://travis-ci.org/WorksApplications/elasticsearch-sudachi.svg?branch=develop)](https://travis-ci.org/WorksApplications/elasticsearch-sudachi)
5-
[![Bugs](https://sonarcloud.io/api/badges/measure?key=com.worksap.nlp%3Aanalysis-sudachi&metric=bugs)](https://sonarcloud.io/project/issues?id=com.worksap.nlp%3Aanalysis-sudachi&resolved=false&types=BUG)
6-
[![Debt](https://sonarcloud.io/api/badges/measure?key=com.worksap.nlp%3Aanalysis-sudachi&metric=sqale_debt_ratio)](https://sonarcloud.io/component_measures/domain/Maintainability?id=com.worksap.nlp%3Aanalysis-sudachi)
7-
[![Coverage](https://sonarcloud.io/api/badges/measure?key=com.worksap.nlp%3Aanalysis-sudachi&metric=coverage)](https://sonarcloud.io/component_measures/metric/coverage/list?id=com.worksap.nlp%3Aanalysis-sudachi)
6+
[![Quality Gate](https://sonarcloud.io/api/project_badges/measure?project=com.worksap.nlp%3Aanalysis-sudachi&metric=alert_status)](https://sonarcloud.io/dashboard/index/com.worksap.nlp%3Aanalysis-sudachi)
87

98
# What's new?
10-
- version 1.1.0: `part-of-speech forward matching` is available on `stoptags`; see [sudachi_part_of_speech](#sudachi_part_of_speech)
9+
10+
- version 1.2.0
11+
- Upgrading sudachi morphological analyzer to 1.2.0-SNAPSHOT
12+
- New filter `sudachi_normalizedform` was added; see [sudachi_normalizedform](#sudachi_normalizedform)
13+
- Default normalization behavior was changed; neather baseform filter and normalziedform filter not applied
14+
- `sudachi_readingform` filter was changed with new romaji mappings based on MS-IME
15+
16+
17+
- version 1.1.0
18+
- `part-of-speech forward matching` is available on `stoptags`; see [sudachi_part_of_speech](#sudachi_part_of_speech)
1119

1220
# Build
1321

@@ -17,27 +25,30 @@ analysis-sudachi is Elasticsearch plugin based on Sudachi the Japanese morpholog
1725
```
1826

1927
# Installation
28+
2029
Follow the steps below to install.
30+
2131
1. Change the current directory "/usr/share/elasticsearch".
2232
2. Place the zip file created with "Build" on the moved directory.
2333
3. Command "sudo bin/elasticsearch-plugin install file:///usr/share/elasticsearch/<zipfile-name>"
2434
4. Place files [system_core.dic or system_full.dic] under ES_HOME/sudachi.
2535

2636
# Configuration
37+
2738
- tokenizer: Select tokenizer. (sudachi) (string)
2839
- mode: Select mode. (normal or search or extended) (string, default: search)
29-
- normal: Regular segmentataion.
30-
Ex) 関西国際空港 / アバラカダブラ
31-
- search: Use a heuristic to do additional segmentation useful for search.
32-
Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
33-
- extended: Similar to search mode, but also unigram unknown words. (experimental)
34-
Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ
40+
- normal: Regular segmentataion. (Use C mode of Sudachi)
41+
Ex) 関西国際空港 / アバラカダブラ
42+
- search: Additional segmentation useful for search. (Use C and A mode)
43+
Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
44+
- extended: Similar to search mode, but also unigram unknown words.
45+
Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ
3546
- discard\_punctuation: Select to discard punctuation or not. (bool, default: true)
3647
- settings\_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to ES\_HOME. (string, default: null)
3748
- resources_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to ES\_HOME. (string, default: null)
3849

39-
**Example**
40-
```
50+
## Example
51+
```json
4152
{
4253
"settings": {
4354
"index": {
@@ -52,8 +63,7 @@ Follow the steps below to install.
5263
},
5364
"analyzer": {
5465
"sudachi_analyzer": {
55-
"filter": [
56-
],
66+
"filter": [],
5767
"tokenizer": "sudachi_tokenizer",
5868
"type": "custom"
5969
}
@@ -64,19 +74,21 @@ Follow the steps below to install.
6474
}
6575
```
6676

67-
# Corresponding Filter
77+
# Filters
78+
6879
## sudachi\_part\_of\_speech
80+
6981
The sudachi\_part\_of\_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
70-
- stoptags
7182

72-
The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
73-
Sudachi POS information is a csv list, consisting 6 items;
83+
The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
84+
85+
Sudachi POS information is a csv list, consisting 6 items;
7486

7587
- 1-4 `part-of-speech hierarchy (品詞階層)`
7688
- 5 `inflectional type (活用型)`
7789
- 6 `inflectional form (活用形)`
7890

79-
With the `stoptags`, you can filter out the result in any of these forward matching forms;
91+
With the `stoptags`, you can filter out the result in any of these forward matching forms;
8092

8193
- 1 - e.g., `名詞`
8294
- 1,2 - e.g., `名詞,固有名詞`
@@ -86,8 +98,8 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
8698
- 6 - e.g., `終止形-一般`
8799
- 5,6 - e.g., `五段-カ行,終止形-一般`
88100

89-
**PUT sudachi_sample**
90-
```
101+
### PUT sudachi_sample
102+
```json
91103
{
92104
"settings": {
93105
"index": {
@@ -124,16 +136,16 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
124136
}
125137
```
126138

127-
**POST sudachi_sample**
128-
```
139+
### POST sudachi_sample
140+
```json
129141
{
130142
"analyzer":"sudachi_analyzer",
131143
"text":"寿司がおいしいね"
132144
}
133145
```
134146

135-
**Which responds with:**
136-
```
147+
### Which responds with:
148+
```json
137149
{
138150
"tokens": [
139151
{
@@ -153,11 +165,13 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
153165
]
154166
}
155167
```
168+
156169
## sudachi\_ja\_stop
170+
157171
The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
158172

159-
**PUT sudachi_sample**
160-
```
173+
### PUT sudachi_sample
174+
```json
161175
{
162176
"settings": {
163177
"index": {
@@ -193,16 +207,16 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
193207
}
194208
```
195209

196-
**POST sudachi_sample**
197-
```
210+
### POST sudachi_sample
211+
```json
198212
{
199213
"analyzer":"sudachi_analyzer",
200214
"text":"私は宇宙人です。"
201215
}
202216
```
203217

204-
**Which responds with:**
205-
```
218+
### Which responds with:
219+
```json
206220
{
207221
"tokens": [
208222
{
@@ -231,10 +245,11 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
231245
```
232246

233247
## sudachi\_baseform
248+
234249
The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.
235250

236-
**PUT sudachi_sample**
237-
```
251+
### PUT sudachi_sample
252+
```json
238253
{
239254
"settings": {
240255
"index": {
@@ -260,16 +275,72 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
260275
}
261276
```
262277

263-
**POST sudachi_sample**
264-
```
278+
### POST sudachi_sample
279+
```json
265280
{
266281
"analyzer": "sudachi_analyzer",
267282
"text": "飲み"
268283
}
269284
```
270285

271-
**Which responds with:**
286+
### Which responds with:
287+
```json
288+
{
289+
"tokens": [
290+
{
291+
"token": "飲む",
292+
"start_offset": 0,
293+
"end_offset": 2,
294+
"type": "word",
295+
"position": 0
296+
}
297+
]
298+
}
272299
```
300+
301+
## sudachi\_normalizedform
302+
303+
The sudachi\_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.
304+
305+
This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_baseform filter with this filter.
306+
307+
### PUT sudachi_sample
308+
```json
309+
{
310+
"settings": {
311+
"index": {
312+
"analysis": {
313+
"tokenizer": {
314+
"sudachi_tokenizer": {
315+
"type": "sudachi_tokenizer",
316+
"resources_path": "/etc/elasticsearch/sudachi"
317+
}
318+
},
319+
"analyzer": {
320+
"sudachi_analyzer": {
321+
"filter": [
322+
"sudachi_normalizedform"
323+
],
324+
"tokenizer": "sudachi_tokenizer",
325+
"type": "custom"
326+
}
327+
}
328+
}
329+
}
330+
}
331+
}
332+
```
333+
334+
### POST sudachi_sample
335+
```json
336+
{
337+
"analyzer": "sudachi_analyzer",
338+
"text": "呑み"
339+
}
340+
```
341+
342+
### Which responds with:
343+
```json
273344
{
274345
"tokens": [
275346
{
@@ -284,15 +355,18 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
284355
```
285356

286357
## sudachi\_readingform
358+
287359
Convert to katakana or romaji reading.
288360
The sudachi\_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:
289361

290-
- use_romaji
291-
Whether romaji reading form should be output instead of katakana. Defaults to false.
292-
When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
362+
### use_romaji
293363

294-
**PUT sudachi_sample**
295-
```
364+
Whether romaji reading form should be output instead of katakana. Defaults to false.
365+
366+
When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
367+
368+
### PUT sudachi_sample
369+
```json
296370
{
297371
"settings": {
298372
"index": {
@@ -333,21 +407,23 @@ The sudachi\_readingform token filter replaces the token with its reading form i
333407
}
334408
```
335409

336-
**POST sudachi_sample**
337-
```
410+
### POST sudachi_sample
411+
412+
```json
338413
{
339414
"analyzer": "katakana_analyzer",
340-
"text": "寿司" ・・・[1]
415+
"text": "寿司"
341416
}
342417
```
418+
Returns `スシ`.
419+
343420
```
344421
{
345422
"analyzer": "romaji_analyzer",
346-
"text": "寿司" ・・・[2]
423+
"text": "寿司"
347424
}
348425
```
349-
[1] Returns スシ.
350-
[2] Returns sushi.
426+
Returns `susi`.
351427

352428
# Releases
353429

@@ -367,6 +443,7 @@ The sudachi\_readingform token filter replaces the token with its reading form i
367443
- first release
368444

369445
# License
446+
370447
Copyright (c) 2017 Works Applications Co., Ltd.
371448
Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch
372449
Originally under lucene, https://lucene.apache.org/

pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
<groupId>com.worksap.nlp</groupId>
66
<artifactId>analysis-sudachi-elasticsearch6.4</artifactId>
7-
<version>1.1.0</version>
7+
<version>1.2.0-SNAPSHOT</version>
88
<packaging>jar</packaging>
99

1010
<name>analysis-sudachi</name>
@@ -14,7 +14,7 @@
1414
<java.version>1.8</java.version>
1515
<elasticsearch.version>6.4.2</elasticsearch.version>
1616
<lucene.version>7.4.0</lucene.version>
17-
<sudachi.version>0.1.1-SNAPSHOT</sudachi.version>
17+
<sudachi.version>0.1.2-SNAPSHOT</sudachi.version>
1818
<jacoco.skip>true</jacoco.skip>
1919
<sonar.skip>true</sonar.skip>
2020
<sonar.host.url>https://sonarcloud.io</sonar.host.url>
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
/*
2+
* Copyright (c) 2018 Works Applications Co., Ltd.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
package com.worksap.nlp.elasticsearch.sudachi.index;
18+
19+
import org.apache.lucene.analysis.TokenStream;
20+
import org.elasticsearch.common.settings.Settings;
21+
import org.elasticsearch.env.Environment;
22+
import org.elasticsearch.index.IndexSettings;
23+
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
24+
25+
import com.worksap.nlp.lucene.sudachi.ja.SudachiNormalizedFormFilter;
26+
27+
public class SudachiNormalizedFormFilterFactory extends AbstractTokenFilterFactory {
28+
29+
public SudachiNormalizedFormFilterFactory(IndexSettings indexSettings,
30+
Environment environment, String name, Settings settings) {
31+
super(indexSettings, name, settings);
32+
}
33+
34+
@Override
35+
public TokenStream create(TokenStream tokenStream) {
36+
return new SudachiNormalizedFormFilter(tokenStream);
37+
}
38+
}

0 commit comments

Comments
 (0)