ElasticSearch - 管理映射

Using explicit mapping creation 使用明確映射建立
Mapping base types 映射基礎形態
Mapping arrays 映射陣列
Mapping an object 映射物件
Mapping a document 映射文件
Using dynamic templates in document mapping 映射文件的動態範本
Managing nested objects 映射巢狀物件
Managing a child document 映射子文件
Adding a field with multiple mappings 增加具多映射的欄位
Mapping a geo point field 映射地理資訊點的欄位
Mapping a geo shape field 映射地理資訊區塊欄位
Mapping an IP field 映射網路位址(IP)欄位
Mapping an attachment field 映射附件欄位
Adding metadata to a mapping 加入映射的metadata
Specifying a different analyzer 特定差異化分析
Mapping a completion suggester 映射完成建議(有點怪, 再想想)

簡介

映射(mapping)在ElasticSearch中是很重要的觀念，它定義了搜尋引擎如何處理文件(document)。

在搜尋引擎中，有兩個主要的運作：

索引(Indexing) -　這是在Index中接收文件(document)和儲存/索引/處理的行為。
搜尋(Searching) - 這是從索引中探尋資料的行為。

上述兩種運作非常緊集的結合，也就是說，在所引發生的錯誤的話，可能連帶會影響搜尋的結果，導致結果不是預期的或是有錯過什麼重要的內容。

ElasticSearch在索引/類別層級，有一個詳盡明確的映射，在索引的時候，如果沒有提供預設的映射(mapping)，一個預設的映射就會從文件組成欄位的結構來猜測並建立。接著，新的映射(mapping)就會自動傳遞到所有的叢集節點。

預設映射種類(default type mapping)都有一個自動感測(sensible)的預設值，如果需要改變mapping狀態或客製化各種其他不同觀念的索引(儲存、忽略、結束等)，就需要提供一個新的映射定義。

接著學習如何操作各種不同類型的映射(mapping)方式。

Using explicit mapping creation 使用詳盡映射

如果把索引當做SQL的資料庫，那麼映射(mapping)就像是資料庫中的表格定義或資料表架構(schema)。

ElasticSearch能夠瞭解我們想要索引文件的結構，並且能自動建立詳盡的映射定義(explicit mapping creation)。

學習explicit mapping時，要準備一個可用的ElasticSearch叢集，還有基礎的JSON知識，接著就可以繼續做囉！

建立索引(create an index)

如同前面使用初探介紹，使用cURL直接對叢集操作：

[kedy@es1 ~]$ curl -XPUT http://es1:9200/test

接著從提示字元可以看到結果：

{"acknowledged":true}

碰到索引建立，但shard沒有分配

在操作過程中，碰到建立index後，shard沒有分配的問題，導致叢集錯誤，狀態變成red，此時可以把index刪除，先判斷刪除後的叢急狀態是否為green，再重建index，接著看是否有正確分配shard和replica，以及叢集狀態是否一樣為正常(green)狀態。

放入文件(put a document)

使用cURL直接對叢集操作，放入一個document，內含兩個欄位，分別是姓名(name)和年紀(age)，採用JSON格式進行資料填寫：

在操作的叢集，URL後方分別為

index
type
id

使用斜線(/)符號進行區分，接著使用 -d 參數，放入欄位與資料。

[kedy@es1 ~]$ curl -XPUT http://es1:9200/test/mytype/1 -d '{"name":"kedy", "age":"31"}'

接著從提示字元可以看到結果：

{"_index":"test","_type":"mytype","_id":"1","_version":2,"_shards":{"total":2,"successful":2,"failed":0},"created":true}

結果顯示操作的的index、type、id、_version、shard等，可以判斷操作是否成功，最後一個created則說明這筆document是否被建立(true)或更新(false)。

顯示映射

為了知道一個type內的各項mapping，能夠過cURL得知，使用命令如下，因前面我們沒有特別指定mapping，因此欄位類型就是ElasticSearch自動mapping的結果：

curl -XGET http://es1:9200/test/mytype/_mapping?pretty=true

直接在主機URL/index/url/type 後面加上 _mapping即可顯示結果，而pretty參數是為了讓人們方便閱讀，使用pretty參數得到的結果如下，會是巢狀的JSON物件：

{
    "test" : {
        "mappings" : {
            "mytype" : {
                "properties" : {
                    "age" : {
                        "type" : "string"
                    },
                    "name" : {
                        "type" : "string"
                    }
                }
            }
        }
    }
}

如果不下pretty=true或將pretty=false顯示結果如下：

{"test":{"mappings":{"mytype":{"properties":{"age":{"type":"string"},"name":{"type":"string"}}}}}}

小結

前面執行了建立索引(create an index)、放入文件(put a document)、顯示映射等工作，在文件索引的階段中，ElasticSearch會檢查該type是否存在，如果不存在，就會依照該欄位的type動態建立適當的類別。

ElasticSearch會讀取所有映射欄位的預設特徵(properties)並且開始處理：

如果欄位已經存在映射中，然後欄位值也是有效的(就是有符合正確的type)，那ElasticSearch就不會改變目前的mapping。
如果欄位已經存在映射中，但是欄位值跟型態映射不符、是不同型態，那麼type inference enging就會更改或升級欄位type，例如從int改成long的形態。而如果type根本不相容，就會造成例外(exception)接著索引程序就會失敗囉。
最後是如果欄位不存在的話，就會自動偵測欄位形態，接著就會更新到一個新欄位的映射中。

每個文件(document)的索引都會使用UID作為唯一的識別，會儲存在該document一個特別的欄位，名稱為 _uid，這個值會自動使用 _id計算得知。而 _id這個值則是在索引的時候被提供，如果 _id不存在的話，ElasticSearch就會自動指派一個數值。

當建立或修改一個映射形態(mapping type)的時候，ElasticSearch會自動傳輸相關映射或改變到所有的叢集節點中，接著所有包含該特定型態的shard都會處理到相映射的更改。

Mapping base types 映射基礎類別

使用詳盡映射(explicit mapping)可以讓我們很快速地在沒有資料庫結構或沒有特定schema的狀況下建立資料，並且不用擔心怎麼選擇或給予適合的欄位型態。因此，為了在索引之後能得到比較好的搜尋結果與效能，手動定義型態就是必要的囉！

經過詳細效調與設計的映射(mapping)有以下優點：

減少磁碟上索引占的容量(停用客製欄位的功能)
只索引有興趣的欄位(泛用來加速效能)
預先準備(precook)需快速搜尋或即時分析的資料(就像聚合)
正確定義欄位一定要在多token情況下分析或者只要考慮單一token的情況分析

ElasticSearch也讓使用者在基礎欄位中使用各種組態。

映射欄位範例

在這邊使用一個商店的範例，來建立欄位的映射

名稱	類型	描述
id	Identifier	訂單唯一識別
date	Data (time)	訂單日期
customer_id	Id reference	消費者ID
name	String	商品名稱
quantity	Integer	商品數量
vat	Double	商品稅金
sent	Boolean	訂單是否出貨

每一筆訂單紀錄，都需要轉成ElasticSearch的映射定義，長這樣子：

{
    "order" : {
        "properties" : {
            "id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},
            "data" : {"type" : "date", "store" : "no", "index":"not_analyzed"},
            "customer_id" : {"type" : "string", "store" : "yes", "index":"not_analyzed"},
            "sent" : {"type" : "boolean", "index":"not_analyzed"},
            "name" : {"type" : "integer", "index":"not_analyzed"},
            "quantity" : {"type" : "integer", "index":"not_analyzed"},
            "vat" : {"type" : "double", "index":"no"}
        }
    }
}

這樣子映射就可以準備放到索引裡面，接著就來看各的描述內容代表的定義。

程式設計領域我們需要知道一個變數的資料型態，各型態都可以對應到ElasticSearch的型態之中，簡單如下表：

Type	ElasticSearch Type	描述
String, VarChar, Text	string	字串、文字欄位，例如：kedy, abc123, UDN1234
Integer	integer	32bit的整數
Long	long	64bit的常數
Float	float	32bit的浮點數
Double	double	64bit的浮點數
Boolean	boolean	布林值(True, False)
Date/Datetime	date	日期或日期時間，例如：2015-12-30或2015-12-30T21:24:00
Bytes/Binary	binary	用來承接二近位檔案或字元串流的資料型態

依據不同的資料型態，在ElasticSearch之中可以選擇明確適合的型態映射，在處理欄位時有較好的管理方向。這些常用的選項，也就是在每個映射欄位後設定的項目如下：

選項	說明
store	標記這個欄位是否儲存在分開的索引區段中，用來進行快速的搜尋取回，此欄位消耗磁碟空間，但如果需要從document中萃取欄位可以減少運算亮(也就是說, in scripting and aggregations). 此選項可能的值有yes跟no(預設是 no). 標記Stored的欄位在分面搜索(faceting search)上的速度會比較快
index	決定此映射欄位是否需要建立索引(index)，預設是會被分析。此選項有三種參數，分別是: no - 表示不需要建立索引，如果確定此映射在搜尋中用不到，設定成no即可；analyzed - 此欄位會定義成會被預設的分析器進行分析，通常是lowercased和tokenzied，會使用預設的ElasticSeach分析器組態； not_analyzed - 這個欄位會建立索引，但是不會被分析器處理，預設的ElasticSearch使用KeyworkAnalyzer欄位，會把欄位當作單一token處理
null_value	如果沒有給予此欄位的值的時候，會給定的欄位預設值。
boost	This is used to change the importance of a field (the default value is 1.0).
index_analyzer	This defines an analyzer to be used in order to process a field. If it is not defined, the analyzer of the parent object is used (the default value is null).
search_analyzer	This defines an analyzer to be used during the search. If it is not defined, the analyzer of the parent object is used (the default value is null).
analyzer	This sets both the index_analyzer and search_analyzer field to the defined value (the default value is null).
include_in_all	This marks the current field to be indexed in the special _all field (a field that contains the concatenated text of all the fields). The default value is true.
index_name	This is the name of the field to be stored in the Index. This property allows you to rename the field at the time of indexing. It can be used to manage data migration in time without breaking the application layer due to changes
norms	This controls the Lucene norms. This parameter is used to better score queries, if the field is used only for filtering. Its best practice to disable it in order to reduce the resource usage (the default value is true for analyzed fields and false for the not_analyzed ones).

管理映射