You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- discard\_punctuation: Select to discard punctuation or not. (bool, default: true)
36
47
- settings\_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to ES\_HOME. (string, default: null)
37
48
- resources_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to ES\_HOME. (string, default: null)
38
49
39
-
**Example**
40
-
```
50
+
## Example
51
+
```json
41
52
{
42
53
"settings": {
43
54
"index": {
@@ -52,8 +63,7 @@ Follow the steps below to install.
52
63
},
53
64
"analyzer": {
54
65
"sudachi_analyzer": {
55
-
"filter": [
56
-
],
66
+
"filter": [],
57
67
"tokenizer": "sudachi_tokenizer",
58
68
"type": "custom"
59
69
}
@@ -64,19 +74,21 @@ Follow the steps below to install.
64
74
}
65
75
```
66
76
67
-
# Corresponding Filter
77
+
# Filters
78
+
68
79
## sudachi\_part\_of\_speech
80
+
69
81
The sudachi\_part\_of\_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
70
-
- stoptags
71
82
72
-
The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
73
-
Sudachi POS information is a csv list, consisting 6 items;
83
+
The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
84
+
85
+
Sudachi POS information is a csv list, consisting 6 items;
74
86
75
87
- 1-4 `part-of-speech hierarchy (品詞階層)`
76
88
- 5 `inflectional type (活用型)`
77
89
- 6 `inflectional form (活用形)`
78
90
79
-
With the `stoptags`, you can filter out the result in any of these forward matching forms;
91
+
With the `stoptags`, you can filter out the result in any of these forward matching forms;
80
92
81
93
- 1 - e.g., `名詞`
82
94
- 1,2 - e.g., `名詞,固有名詞`
@@ -86,8 +98,8 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
86
98
- 6 - e.g., `終止形-一般`
87
99
- 5,6 - e.g., `五段-カ行,終止形-一般`
88
100
89
-
**PUT sudachi_sample**
90
-
```
101
+
### PUT sudachi_sample
102
+
```json
91
103
{
92
104
"settings": {
93
105
"index": {
@@ -124,16 +136,16 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
124
136
}
125
137
```
126
138
127
-
**POST sudachi_sample**
128
-
```
139
+
### POST sudachi_sample
140
+
```json
129
141
{
130
142
"analyzer":"sudachi_analyzer",
131
143
"text":"寿司がおいしいね"
132
144
}
133
145
```
134
146
135
-
**Which responds with:**
136
-
```
147
+
### Which responds with:
148
+
```json
137
149
{
138
150
"tokens": [
139
151
{
@@ -153,11 +165,13 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
153
165
]
154
166
}
155
167
```
168
+
156
169
## sudachi\_ja\_stop
170
+
157
171
The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
158
172
159
-
**PUT sudachi_sample**
160
-
```
173
+
### PUT sudachi_sample
174
+
```json
161
175
{
162
176
"settings": {
163
177
"index": {
@@ -193,16 +207,16 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
193
207
}
194
208
```
195
209
196
-
**POST sudachi_sample**
197
-
```
210
+
### POST sudachi_sample
211
+
```json
198
212
{
199
213
"analyzer":"sudachi_analyzer",
200
214
"text":"私は宇宙人です。"
201
215
}
202
216
```
203
217
204
-
**Which responds with:**
205
-
```
218
+
### Which responds with:
219
+
```json
206
220
{
207
221
"tokens": [
208
222
{
@@ -231,10 +245,11 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
231
245
```
232
246
233
247
## sudachi\_baseform
248
+
234
249
The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.
235
250
236
-
**PUT sudachi_sample**
237
-
```
251
+
### PUT sudachi_sample
252
+
```json
238
253
{
239
254
"settings": {
240
255
"index": {
@@ -260,16 +275,72 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
260
275
}
261
276
```
262
277
263
-
**POST sudachi_sample**
264
-
```
278
+
### POST sudachi_sample
279
+
```json
265
280
{
266
281
"analyzer": "sudachi_analyzer",
267
282
"text": "飲み"
268
283
}
269
284
```
270
285
271
-
**Which responds with:**
286
+
### Which responds with:
287
+
```json
288
+
{
289
+
"tokens": [
290
+
{
291
+
"token": "飲む",
292
+
"start_offset": 0,
293
+
"end_offset": 2,
294
+
"type": "word",
295
+
"position": 0
296
+
}
297
+
]
298
+
}
272
299
```
300
+
301
+
## sudachi\_normalizedform
302
+
303
+
The sudachi\_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.
304
+
305
+
This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_baseform filter with this filter.
306
+
307
+
### PUT sudachi_sample
308
+
```json
309
+
{
310
+
"settings": {
311
+
"index": {
312
+
"analysis": {
313
+
"tokenizer": {
314
+
"sudachi_tokenizer": {
315
+
"type": "sudachi_tokenizer",
316
+
"resources_path": "/etc/elasticsearch/sudachi"
317
+
}
318
+
},
319
+
"analyzer": {
320
+
"sudachi_analyzer": {
321
+
"filter": [
322
+
"sudachi_normalizedform"
323
+
],
324
+
"tokenizer": "sudachi_tokenizer",
325
+
"type": "custom"
326
+
}
327
+
}
328
+
}
329
+
}
330
+
}
331
+
}
332
+
```
333
+
334
+
### POST sudachi_sample
335
+
```json
336
+
{
337
+
"analyzer": "sudachi_analyzer",
338
+
"text": "呑み"
339
+
}
340
+
```
341
+
342
+
### Which responds with:
343
+
```json
273
344
{
274
345
"tokens": [
275
346
{
@@ -284,15 +355,18 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
284
355
```
285
356
286
357
## sudachi\_readingform
358
+
287
359
Convert to katakana or romaji reading.
288
360
The sudachi\_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:
289
361
290
-
- use_romaji
291
-
Whether romaji reading form should be output instead of katakana. Defaults to false.
292
-
When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
362
+
### use_romaji
293
363
294
-
**PUT sudachi_sample**
295
-
```
364
+
Whether romaji reading form should be output instead of katakana. Defaults to false.
365
+
366
+
When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
367
+
368
+
### PUT sudachi_sample
369
+
```json
296
370
{
297
371
"settings": {
298
372
"index": {
@@ -333,21 +407,23 @@ The sudachi\_readingform token filter replaces the token with its reading form i
333
407
}
334
408
```
335
409
336
-
**POST sudachi_sample**
337
-
```
410
+
### POST sudachi_sample
411
+
412
+
```json
338
413
{
339
414
"analyzer": "katakana_analyzer",
340
-
"text": "寿司" ・・・[1]
415
+
"text": "寿司"
341
416
}
342
417
```
418
+
Returns `スシ`.
419
+
343
420
```
344
421
{
345
422
"analyzer": "romaji_analyzer",
346
-
"text": "寿司" ・・・[2]
423
+
"text": "寿司"
347
424
}
348
425
```
349
-
[1] Returns スシ.
350
-
[2] Returns sushi.
426
+
Returns `susi`.
351
427
352
428
# Releases
353
429
@@ -367,6 +443,7 @@ The sudachi\_readingform token filter replaces the token with its reading form i
367
443
- first release
368
444
369
445
# License
446
+
370
447
Copyright (c) 2017 Works Applications Co., Ltd.
371
448
Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch
372
449
Originally under lucene, https://lucene.apache.org/
0 commit comments