Elasticsearch on multiple fields with partial and full matches -
our account model has first_name, last_name , ssn (social security number).
i want partial matches on first_name,last_name' exact match on ssn. have far:
settings analysis: { filter: { substring: { type: "ngram", min_gram: 3, max_gram: 50 }, ssn_string: { type: "ngram", min_gram: 9, max_gram: 9 }, }, analyzer: { index_ngram_analyzer: { type: "custom", tokenizer: "standard", filter: ["lowercase", "substring"] }, search_ngram_analyzer: { type: "custom", tokenizer: "standard", filter: ["lowercase", "substring"] }, ssn_ngram_analyzer: { type: "custom", tokenizer: "standard", filter: ["ssn_string"] }, } } mapping [:first_name, :last_name].each |attribute| indexes attribute, type: 'string', index_analyzer: 'index_ngram_analyzer', search_analyzer: 'search_ngram_analyzer' end indexes :ssn, type: 'string', index: 'not_analyzed' end my search follows:
query: { multi_match: { fields: ["first_name", "last_name", "ssn"], query: query, type: "cross_fields", operator: "and" } }
so works:
account.search("erik").records.to_a and (for erik smith):
account.search("erik smi").records.to_a and ssn:
account.search("111112222").records.to_a but not:
account.search("erik 111112222").records.to_a any idea if indexing or querying wrong?
thank help!
does have done single query string? if not, this:
put /test_index { "settings": { "number_of_shards": 1, "analysis": { "filter": { "ngram_filter": { "type": "ngram", "min_gram": 2, "max_gram": 20 } }, "analyzer": { "ngram_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "ngram_filter" ] } } } }, "mappings": { "doc": { "_all": { "enabled": true, "index_analyzer": "ngram_analyzer", "search_analyzer": "standard" }, "properties": { "first_name": { "type": "string", "include_in_all": true }, "last_name": { "type": "string", "include_in_all": true }, "ssn": { "type": "string", "index": "not_analyzed", "include_in_all": false } } } } } notice use of the_all field. included first_name , last_name in _all, not ssn, , ssn not analyzed @ since want exact matches against it.
i indexed couple of documents illustration:
post /test_index/doc/_bulk {"index":{"_id":1}} {"first_name":"erik","last_name":"smith","ssn":"111112222"} {"index":{"_id":2}} {"first_name":"bob","last_name":"jones","ssn":"123456789"} then can query partial names, , filter exact ssn:
post /test_index/doc/_search { "query": { "filtered": { "query": { "match": { "_all": { "query": "eri smi", "operator": "and" } } }, "filter": { "term": { "ssn": "111112222" } } } } } and i'm expecting:
{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8838835, "hits": [ { "_index": "test_index", "_type": "doc", "_id": "1", "_score": 0.8838835, "_source": { "first_name": "erik", "last_name": "smith", "ssn": "111112222" } } ] } } if need able search single query string (no filter), include ssn in all field well, setup match on partial strings (like 111112) may not want.
if want match prefixes (i.e., search terms start @ beginning of words), should use edge ngrams.
i wrote blog post using ngrams might out little: http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch
here code used answer. tried few different things, including setup posted here, , inluding ssn in _all, edge ngrams. hope helps:
http://sense.qbox.io/gist/b6a31c929945ef96779c72c468303ea3bc87320f
Comments
Post a Comment