concepts Link to heading

  • cluster: a collection of nodes
  • node: part to store data
  • index: a collection of similar documents
  • type: a category or a partition of indexes
  • document: json formated data
  • shard/replica: shard of a portion of indexes, whereas replica is a segment of an index.

concepts visual

roles of node Link to heading

  • master
  • data
  • ingest: it somewhat like simplified logstash
  • ml

by default, it’s dim permission. for different size of cluster, especially medium and large size, we could have dedicated master nodes.

don’t changes roles unless you fully understand what you are doing.

easy CRUD operations Link to heading

<REST verb> <index> <type> <id>

for example: GET _cat/model?v or GET _cat/indices?v

above is getting information out of elasticsearch

or

PUT /sales/order/123
{
    "orderID": "123",
    "orderAmount": "500"
}

creating documents

it’s best practice to create indices first and then add documents.

or

POST /sales/_update/123
{
    "doc": {
        "orderID": 1234
    }
}

this update a document

scripted update Link to heading

POST /products/_update/123
{
    "script": {
        "source": "ctx._source.in_stock -= params.quantity",
        "params": {
            "quantity": 4
        }
    }
}

ctx is short of context

there are even more complicated scripts

you can insert if conditions diverting operations.

POST /products/_update/123
{
    "script": {
        "source" : """
            if (ctx._source.in_stock == 0) {
                ctx.op = 'noop';
            }

            ctx._source.in_stock--;
        """
    }
}

bonus information Link to heading

reading documents – routing Link to heading

how does elaticsearch know which shard to store documents? how are documents found once they have been indexed?

elasticsearch uses routing mechanism to search documents.

shard_num = hash(_routing) % num_primary_shards

as you can see, in the equation there is a parameter called num_primary_shards, and remember we say once shards been created, it cannot be changed. one has to use shrink or split api to reindex to make sure old documents are searchable.

writing doucments – primary term and checkpoints Link to heading

primary terms is a counter for how many times the primary shards has changed (writing operations).

sequence numbers is a counter that being incremented for each write operation. note that primary shards increases the sequence number.

as for checkpoints

each replication group has a global checkpoint and each replica shard has a local checkpoint

global checkpoints is the sequence number that all shards within the replication group has been applied at least up to

local checkpoints is the last write operation that was performed.

checkpoints are essentially sequence numbers

given above information, we could imply when network failing:

  • when primary shards failed and wants to rejoin, it will just compare with global checkpoint and apply the operations later than global checkpoints

  • when replica shards failed and wants to rejoin, it will only compare with local checkpoint and only higher number will rejoin.

bulk api Link to heading

when writes many at the same time, we want to use /_bulk API

under /_bulk

in terminal, we do

curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/<index>/<type>/_bulk?pretty --data-binary "@<json_file_name>"; echo

for the file “@<json_file_name>”, there must be an empty line at the end of the file.

in Kibana

POST /_bulk { <action> } { <document> }

data types Link to heading

  • core data type: text, numric, boolean, binary
  • complex: array, object
  • geo: geo_point, geo_shape
  • specialized: ip addr, token

meta fields Link to heading

  • _index: name of the index
  • _id: id
  • _source: original json object used when indexing a document
  • _field_names
  • _routing
  • _version
  • _meta

mappings Link to heading

for simple use of elasticsearch, mappings are rarely used. but using mappings can control elasticsearch concisely.

or just trust dynamic mapping is good enough.

analyzers Link to heading

once we index a document, elasticsearch will run the whole analyze process. when later we search, we are not searching the document; we are searching the inverted index.

analyzer consists of (sequential order):

  • character filters: to manipulate text before tokenization. like taking out some mark up language such as html, defaults are:
    • html strip filter
    • mapping character filter
    • pattern replace
  • tokenizer: chop words into term. i.e. do some logic on top of tokens, like lower the case, etc. 3 main categories are:
    • word oriented tokenizers
      • standard tokenizer (standard)
      • letter tokenizer (letter)
      • lowercase tokenizer (lowercase)
      • whitespace tokenizer (whitespace)
      • uax url email tokenizer (uax_url_email)
    • partial word tokenizers
      • N-Gram tokenizer (ngram): “red wine” -> [re, red, ed, wi, win, wine, in, ine, ne]
      • edge N-Gram tokenizer (edge_ngram)
    • structured text tokenizers
      • keyword tokenizer (keyword)
      • pattern tokenizer (pattern)
      • path tokenizer (path_hierarchy)
  • token filters: get rid of some white spaces, comma, etc
    • stardard
    • lowercase
    • uppercase
    • nGram
    • edgeNGram
    • stop
    • word_delimiter
    • stemmer
    • keyword_marker
    • snowball
    • synonym

standard analyzer doesn’t contain a character filter

  • analyzers by default:

    • simple analyzer
    • stop analyzer
    • language analyzer
    • keyword analyzer
    • pattern analyzer
    • whitespace analyzer
  • customizing analyzers

heirachy: settings -> analysis -> analyzer/filter/......

analyzer can be also created by our own. with in analyzer block, we need to give tokenizer, char_filter, token filter(s)

previously, people remove stop words before analyze. but with the development of search algorithm, we don’t need to remove stop words anymore because search algorithm will handle stop words pretty well.

inverted index : an index data structure storing a mapping from content to index. Link to heading

there are two types of inverted index, a record-level inverted index and a word-level inverted index.

the latter form asked for more processing power and space.

a concrete example

Suppose we want to search the texts “hello everyone, " “this article is based on inverted index, " “which is hashmap like data structure”. If we index by (text, word within the text), the index with location in text is:

 hello                (1, 1)
 everyone             (1, 2)
 this                 (2, 1)
 article              (2, 2)
 is                   (2, 3); (3, 2)
 based                (2, 4)
 on                   (2, 5)
 inverted             (2, 6)
 index                (2, 7)
 which                (3, 1)
 hashmap              (3, 3)
 like                 (3, 4)
 data                 (3, 5)
 structure            (3, 6)

query/search Link to heading

elasticsearch uses query DSL, a query based on JSON

query DSL Link to heading

  • Leaf Query: search values in particular field
  • Compound Query: contains multiple leaf queries or compound queries themselves

algorithm using behind _score Link to heading

elasticsearch previously used TF/IDF, but now it uses BM25. in general, they are similar, but there are still slight differences. because similarities in general, we will discuss TF/IDF anyways.

  • term frequency TF: it looks how many times the term appear in a given DOCUMENT. the more it appears, the more important
  • inverse document frequency IDF: how often does the term appear in INDEX (all documents). the more it appears, the less important

TF/IDF formula is w(i, j) = tf(i, j) x log(N/df(i)) where

tf(i, j) is number of occurrences of i in j

df(i) is number of documents containing i

N is total number of documents

  • a third factor to consider: field length norm

meaning how long the field is. if the term appear in a field with length 50, then it’s more important than the same appearing in a field with length 5000.

query context and filter context Link to heading

  • query context will calculate relevance score
  • filter context do a boolean evaluation, it’s either a match or not.

term level and full text query Link to heading

term level query doesn’t do analyzing, it searches in inverted index.

full text query search analyzed text fields. meaning it will analyze first.

one condition Link to heading

match all query GET bank/account/_search

or

GET bank/account/_search
{
    "query": {
        "match": {
            "state": "CA"
        }
    }
}

remember match with analyze text first

multi conditions Link to heading

GET bank/account/_search
{
    "query": {
        "bool": {
            "must/must_not": [
                {"match": {
                    //match condition 1
                }},
                {"match": {
                    //match condition 2
                }}
            ]
        }
    }
}

this is more like a bool query

bool query usually used for multiple queries

boost Link to heading

key word boost is to boost the document to the front, in another word, making the document higher priority.

boost works with should

example:

GET bank/account/_search
{
    "query": {
        "bool": {
            "should": [
                {"match": {"state": "CA"}},
                {"match": {"lastname": {"query": "Smith", "boost": 3}}}
            ]
        }
    }
}

the above query will boost field lastname, it will affect _score

term-level query Link to heading

term only works with keywords and numeric values, will not work with text field

terms can query multiple input

range will query with gte and lte, searching in that range

example:

GET bank/account/_search
{
    "query": {
        "range": {
            "account_number": {
                "gte": 516,
                "lte": 851,
                "boost": 2
            }
        }
    }
}

joining queries Link to heading

in theory, in elaticsearch, or all key-value pair storage, it’s not like relational db has primary keys and foriegn keys.

the idea ‘join’ is usually completed in application level.

nested inner hits Link to heading

inner hits is a option of nested, will tell insights which hits object are comming from

join_field Link to heading

define parent child relations or define which side this document belongs to, parent or child.

parent and child documents have to be in the same shard. so we have to provide routing id.

has_parent, has_child Link to heading

analysis and tokenization Link to heading

analyze endpoint will analyze the the text, and give the token.

example

GET bank/_analyze
{
    "tokenizer": "standard",
    "text": "The Moon is Made of Cheese someone say"
}

here tokenizer works like a delimiter

there is also one keyword called analyzer, which will direct es what to do with the field, for example: standard analyzer will lower case all text whereas english analyzer will delete plural form.

analyze data Link to heading

aggregation Link to heading

aggs means aggregation. aggs can be nested into another aggs

very complicated example:

GET bank/account/_search
{
    "size": 0,
    "aggs": {
        "states": {
            "terms": {
                "field": "state.keyword"
            },
            "aggs": {
                "avg_bal": {
                    "avg": { "field": "balance" }
                },
                "gender": {
                    "terms": { "field": "gender.keyword" }
                },
                "aggs": {"avg_bal": {"avg": { "field": "balance" }}}
            }
        }
    }
}

stats is very useful as well.

filter aggregation Link to heading

one way doing it is before aggs, using query to filter.

GET bank/account/_search
{
    "size": 0,
    "query": {
        "match": {"state.keyword": "CA"}
    },
    "aggs": {
        "states": {
            "terms": {
                "field": "state.keyword"
            },
            "aggs": {
                "avg_bal": {
                    "avg": { "field": "balance" }
                },
                "gender": {
                    "terms": { "field": "gender.keyword" }
                },
                "aggs": {"avg_bal": {"avg": { "field": "balance" }}}
            }
        }
    }
}

another way is using filter keyword

GET bank/account/_search
{
    "size": 0,
    "query": {
        "match": {"state.keyword": "CA"}
    },
    "aggs": {
        "over35": {
            "filter": {
                "range": {"age": {"gt": 35}}
            }
        },
        "aggs": {"avg_bal": {"avg": {"field": "balance"}}}
    }
}

using filter, we can filter in aggreation level (inside aggragation).

metrics aggregation Link to heading

"aggs" : {
    "FIELD" : {
        "AGG_TYPE" : { "field": "xxxxxx" }
    }
}

AGG_TYPE could be

  • sum
  • avg
  • min
  • max
  • cardinality
  • value_count
  • amount_stats

bucket aggregation Link to heading

bucket aggregation will create bucket for each unique value according to criteria.

term aggretion

aggregation can be nested in aggregations

the outer layer of aggregation is queried in context of query, and inner aggregation is queried in context of aggregated/filtered context.

global aggregation can be only placed on top level of the query.

nested aggregation can be used doing fancier queries

improve searches Link to heading

proximity Link to heading

using slop, meaning how far allowed of terms (edit) distances (despite terms order)

proximity affects relevances scores

consider must working with should in bool query

fuzziness Link to heading

using levenstien distances behind the scene

operating on term level

fuzziness also introduce transipositions

synonyms Link to heading

use POST /synonyms/_update_by_query to update synonyms, or the documents wouldn’t be re indexed.

highlight Link to heading

we can use highlight to highlight matches.

stemming Link to heading

percentiles and histagram Link to heading

keywords: percentiles, hdr, percentile_ranks, histagram

they are helping to understand the distribution of data inside es.

visualization Link to heading

create visualization first and then use those ‘modules’ to dashboard.