WorksApplications
diff --git a/‎README.md‎
Lines changed: 125 additions & 48 deletions b/‎README.md‎
Lines changed: 125 additions & 48 deletions
diff --git a/‎pom.xml‎
Lines changed: 2 additions & 2 deletions b/‎pom.xml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/main/java/com/worksap/nlp/elasticsearch/sudachi/index/SudachiNormalizedFormFilterFactory.java‎
Lines changed: 38 additions & 0 deletions b/‎src/main/java/com/worksap/nlp/elasticsearch/sudachi/index/SudachiNormalizedFormFilterFactory.java‎
Lines changed: 38 additions & 0 deletions
@@ -1,13 +1,21 @@
 # analysis-sudachi
-analysis-sudachi is Elasticsearch plugin based on Sudachi the Japanese morphological analyzer.
+
+analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.
 
 [![Build Status](https://travis-ci.org/WorksApplications/elasticsearch-sudachi.svg?branch=develop)](https://travis-ci.org/WorksApplications/elasticsearch-sudachi)
-[![Bugs](https://sonarcloud.io/api/badges/measure?key=com.worksap.nlp%3Aanalysis-sudachi&metric=bugs)](https://sonarcloud.io/project/issues?id=com.worksap.nlp%3Aanalysis-sudachi&resolved=false&types=BUG)
-[![Debt](https://sonarcloud.io/api/badges/measure?key=com.worksap.nlp%3Aanalysis-sudachi&metric=sqale_debt_ratio)](https://sonarcloud.io/component_measures/domain/Maintainability?id=com.worksap.nlp%3Aanalysis-sudachi)
-[![Coverage](https://sonarcloud.io/api/badges/measure?key=com.worksap.nlp%3Aanalysis-sudachi&metric=coverage)](https://sonarcloud.io/component_measures/metric/coverage/list?id=com.worksap.nlp%3Aanalysis-sudachi)
+[![Quality Gate](https://sonarcloud.io/api/project_badges/measure?project=com.worksap.nlp%3Aanalysis-sudachi&metric=alert_status)](https://sonarcloud.io/dashboard/index/com.worksap.nlp%3Aanalysis-sudachi)
 
 # What's new?
-- version 1.1.0: `part-of-speech forward matching` is available on `stoptags`; see [sudachi_part_of_speech](#sudachi_part_of_speech)
+
+- version 1.2.0
+    - Upgrading sudachi morphological analyzer to 1.2.0-SNAPSHOT
+    - New filter `sudachi_normalizedform` was added; see [sudachi_normalizedform](#sudachi_normalizedform)
+    - Default normalization behavior was changed; neather baseform filter and normalziedform filter not applied
+    - `sudachi_readingform` filter was changed with new romaji mappings based on MS-IME
+
+
+- version 1.1.0
+    - `part-of-speech forward matching` is available on `stoptags`; see [sudachi_part_of_speech](#sudachi_part_of_speech)
 
 # Build
 
@@ -17,27 +25,30 @@ analysis-sudachi is Elasticsearch plugin based on Sudachi the Japanese morpholog
 ```
 
 # Installation
+
 Follow the steps below to install.
+
 1. Change the current directory "/usr/share/elasticsearch".
 2. Place the zip file created with "Build" on the moved directory.
 3. Command "sudo bin/elasticsearch-plugin install file:///usr/share/elasticsearch/<zipfile-name>"
 4. Place files [system_core.dic or system_full.dic] under ES_HOME/sudachi.
 
 # Configuration
+
 - tokenizer: Select tokenizer. (sudachi) (string)
 - mode: Select mode. (normal or search or extended) (string, default: search)
-	- normal: Regular segmentataion.
-	Ex) 関西国際空港 / アバラカダブラ
-	- search: Use a heuristic to do additional segmentation useful for search.
-	Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ
-	- extended: Similar to search mode, but also unigram unknown words. (experimental)
-	Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ
+	- normal: Regular segmentataion. (Use C mode of Sudachi)  
+      Ex) 関西国際空港 / アバラカダブラ
+	- search: Additional segmentation useful for search. (Use C and A mode)  
+	  Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ
+	- extended: Similar to search mode, but also unigram unknown words.  
+	  Ex）関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ
 - discard\_punctuation: Select to discard punctuation or not. (bool, default: true)
 - settings\_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to ES\_HOME. (string, default: null)
 - resources_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to ES\_HOME. (string, default: null)
 
-**Example**
-```
+## Example
+```json
 {
   "settings": {
     "index": {
@@ -52,8 +63,7 @@ Follow the steps below to install.
         },
         "analyzer": {
           "sudachi_analyzer": {
-            "filter": [
-			],
+            "filter": [],
             "tokenizer": "sudachi_tokenizer",
             "type": "custom"
           }
@@ -64,19 +74,21 @@ Follow the steps below to install.
 }
 ```
 
-# Corresponding Filter
+# Filters
+
 ## sudachi\_part\_of\_speech
+
 The sudachi\_part\_of\_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
-- stoptags
 
- The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
- Sudachi POS information is a csv list, consisting 6 items;
+The `stopatgs` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.
+
+Sudachi POS information is a csv list, consisting 6 items;
 
 - 1-4 `part-of-speech hierarchy (品詞階層)`
 - 5 `inflectional type (活用型)`
 - 6 `inflectional form (活用形)`
 
- With the `stoptags`, you can filter out the result in any of these forward matching forms;
+With the `stoptags`, you can filter out the result in any of these forward matching forms;
 
 - 1 - e.g., `名詞`
 - 1,2 - e.g., `名詞,固有名詞`
@@ -86,8 +98,8 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
 - 6 - e.g., `終止形-一般`
 - 5,6 - e.g., `五段-カ行,終止形-一般`
 
-**PUT sudachi_sample**
-```
+### PUT sudachi_sample
+```json
 {
   "settings": {
     "index": {
@@ -124,16 +136,16 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
 }
 ```
 
-**POST sudachi_sample**
-```
+### POST sudachi_sample
+```json
 {
     "analyzer":"sudachi_analyzer",
     "text":"寿司がおいしいね"
 }
 ```
 
-**Which responds with:**
-```
+### Which responds with:
+```json
 {
     "tokens": [
         {
@@ -153,11 +165,13 @@ The sudachi\_part\_of\_speech token filter removes tokens that match a set of pa
     ]
 }
 ```
+
 ## sudachi\_ja\_stop
+
 The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
 
-**PUT sudachi_sample**
-```
+### PUT sudachi_sample
+```json
 {
   "settings": {
     "index": {
@@ -193,16 +207,16 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
 }
 ```
 
-**POST sudachi_sample**
-```
+### POST sudachi_sample
+```json
 {
  "analyzer":"sudachi_analyzer",
  "text":"私は宇宙人です。"
 }
 ```
 
-**Which responds with:**
-```
+### Which responds with:
+```json
 {
     "tokens": [
         {
@@ -231,10 +245,11 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
 ```
 
 ## sudachi\_baseform
+
 The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.
 
-**PUT sudachi_sample**
-```
+### PUT sudachi_sample
+```json
 {
   "settings": {
     "index": {
@@ -260,16 +275,72 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
 }
 ```
 
-**POST sudachi_sample**
-```
+### POST sudachi_sample
+```json
 {
   "analyzer": "sudachi_analyzer",
   "text": "飲み"
 }
 ```
 
-**Which responds with:**
+### Which responds with:
+```json
+{
+    "tokens": [
+        {
+            "token": "飲む",
+            "start_offset": 0,
+            "end_offset": 2,
+            "type": "word",
+            "position": 0
+        }
+    ]
+}
 ```
+
+## sudachi\_normalizedform
+
+The sudachi\_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.
+
+This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_baseform filter with this filter.
+
+### PUT sudachi_sample
+```json
+{
+  "settings": {
+    "index": {
+      "analysis": {
+        "tokenizer": {
+          "sudachi_tokenizer": {
+            "type": "sudachi_tokenizer",
+            "resources_path": "/etc/elasticsearch/sudachi"
+          }
+        },
+        "analyzer": {
+          "sudachi_analyzer": {
+            "filter": [
+              "sudachi_normalizedform"
+            ],
+            "tokenizer": "sudachi_tokenizer",
+            "type": "custom"
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+### POST sudachi_sample
+```json
+{
+  "analyzer": "sudachi_analyzer",
+  "text": "呑み"
+}
+```
+
+### Which responds with:
+```json
 {
     "tokens": [
         {
@@ -284,15 +355,18 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
 ```
 
 ## sudachi\_readingform
+
 Convert to katakana or romaji reading.
 The sudachi\_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:
 
-- use_romaji
- Whether romaji reading form should be output instead of katakana. Defaults to false.
- When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom     sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
+### use_romaji
 
-**PUT sudachi_sample**
-```
+Whether romaji reading form should be output instead of katakana. Defaults to false.
+
+When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
+
+### PUT sudachi_sample
+```json
 {
     "settings": {
         "index": {
@@ -333,21 +407,23 @@ The sudachi\_readingform token filter replaces the token with its reading form i
 }
 ```
 
-**POST sudachi_sample**
-```
+### POST sudachi_sample
+
+```json
 {
   "analyzer": "katakana_analyzer",
-  "text": "寿司"  ・・・[1]
+  "text": "寿司"
 }
 ```
+Returns `スシ`.
+
 ```
 {
   "analyzer": "romaji_analyzer",
-  "text": "寿司"  ・・・[2]
+  "text": "寿司"
 }
 ```
-[1] Returns スシ.
-[2] Returns sushi.
+Returns `susi`.
 
 # Releases
 
@@ -367,6 +443,7 @@ The sudachi\_readingform token filter replaces the token with its reading form i
 - first release
 
 # License
+
 Copyright (c) 2017 Works Applications Co., Ltd.
 Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch
 Originally under lucene, https://lucene.apache.org/
@@ -4,7 +4,7 @@
 
 	<groupId>com.worksap.nlp</groupId>
     <artifactId>analysis-sudachi-elasticsearch6.4</artifactId>
-    <version>1.1.0</version>
+    <version>1.2.0-SNAPSHOT</version>
 	<packaging>jar</packaging>
 
 	<name>analysis-sudachi</name>
@@ -14,7 +14,7 @@
 		<java.version>1.8</java.version>
 		<elasticsearch.version>6.4.2</elasticsearch.version>
 		<lucene.version>7.4.0</lucene.version>
-		<sudachi.version>0.1.1-SNAPSHOT</sudachi.version>
+		<sudachi.version>0.1.2-SNAPSHOT</sudachi.version>
 		<jacoco.skip>true</jacoco.skip>
 		<sonar.skip>true</sonar.skip>
 		<sonar.host.url>https://sonarcloud.io</sonar.host.url>
 
@@ -0,0 +1,38 @@
+/*
+ *  Copyright (c) 2018 Works Applications Co., Ltd.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.worksap.nlp.elasticsearch.sudachi.index;
+
+import org.apache.lucene.analysis.TokenStream;
+import org.elasticsearch.common.settings.Settings;
+import org.elasticsearch.env.Environment;
+import org.elasticsearch.index.IndexSettings;
+import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
+
+import com.worksap.nlp.lucene.sudachi.ja.SudachiNormalizedFormFilter;
+
+public class SudachiNormalizedFormFilterFactory extends AbstractTokenFilterFactory {
+
+    public SudachiNormalizedFormFilterFactory(IndexSettings indexSettings,
+            Environment environment, String name, Settings settings) {
+        super(indexSettings, name, settings);
+    }
+
+    @Override
+    public TokenStream create(TokenStream tokenStream) {
+        return new SudachiNormalizedFormFilter(tokenStream);
+    }
+}