Play with Elasticsearch basic queries

Greetings

In a previous article, we set up our environment to play with Elasticsearch. Let's try basic queries in this article by building a movie search.

Unlike other databases, Elasticsearch gives full searching power. Let's slowly dig into that power. Hence this is not a deep dive.

I use the top 250 movies from imdb-api in this exercise.

Movie database

Let's build a movie database using imdb-api and play with queries.

Save movies as documents
Movies to contain text, numbers, arrays, full text
Structured search using movie ratings, year
Full-text search (title, actors)
Phrase search (plot, title)
Return highlighted search snippets

Note: You can either use Kibana UI or a direct call to Elasticsearch. Hence I skip the URL path.

curl -XPUT -H "Content-Type:application/json" localhost:9200/movies/_doc/tt0068646 -d '
{
  "title": "The Godfather",
  "year": 1972,
  "rating": 9.2
}'

Indexing movies

In Elasticsearch, the act of storing data is called indexing. It provides an Index API to manage data.

Create the index

As for this exercise let's keep things simple and use dynamic mappings.

PUT /movies

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "movies"
}

Verify the existence of the index using the below query

GET /_cat/indices

Save a movie

In Elasticsearch we can provide our own id or let it create an Id for us. We will use an Elasticsearch-created id. Hence we use the POST API.

POST /movies/_doc
{
  "id": "tt0068646",
  "title": "The Godfather",
  "year": 1972,
  "rating": 9.2,
  "genres": ["Crime", "Drama"],
  "actors": [
    "Francis Ford Coppola",
    "Marlon Brando",
    "Al Pacino",
    "James Caan",
    "Diane Keaton"
  ],
  "plot": "The aging patriarch of an organized crime dynasty in postwar New York City transfers control of his clandestine empire to his reluctant youngest son."
}

{
  "_index": "movies",
  "_id": "5pqy1YYBcKgBsB5lhM3d",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Or, we can provide our own id.

PUT /movies/_doc/tt0068646
{
  "id": "tt0068646",
  "title": "The Godfather",
  "year": 1972,
  "rating": 9.2,
  "genres": ["Crime", "Drama"],
  "actors": [
    "Francis Ford Coppola",
    "Marlon Brando",
    "Al Pacino",
    "James Caan",
    "Diane Keaton"
  ],
  "plot": "The aging patriarch of an organized crime dynasty in postwar New York City transfers control of his clandestine empire to his reluctant youngest son."
}

Get a movie

Use the GET endpoint to retrieve the movie by id.

GET /movies/_doc/5pqy1YYBcKgBsB5lhM3d

Or, with our own id,

GET /movies/_doc/tt0068646

Delete a movie

Same as above, we can use DELETE to delete a movie.

DELETE /movies/_doc/5pqy1YYBcKgBsB5lhM3d

Search

In order to practice search queries, we need a larger data set. Hence I have extracted the top 250 movies from imdb-api. You can find the dataset in my GitHub repo. We can use the bulk API to insert all at once. Note that this special JSON format is ndjson.

Dataset: Top 250 Movies

curl -s --header "Content-Type:application/json" -X POST localhost:9200/movies/_bulk --data-binary @movies-bulk.json

Or, with the latest version,

curl -s -XPOST localhost:9200/movies/_bulk -H "Content-Type:application/x-ndjson" --data-binary @movies-bulk.json

Now again try the search query. In Elasticsearch, all the information that we need, in order to display the search results to the user, is returned. This by default return only the first 10 records.

GET /movies/_search
{
  "query": {
    "match_all": {}
  }
}

Elasticsearch provides a rich, flexible, query language called the query DSL which allows us to build more complicated, robust queries.

GET /movies/_search
{
  "query": {
    "match": {
      "title": "The lord of the ring"
    }
  }
}

However, the above will return movies other than this exact text. That is how and why we use Elasticsearch. The search is not exact but relevant.

Structured search

We need to use "term query" to find exact values. This is ideal for numbers and dates but not for text fields because text fields will be analyzed. The exact value for text fields is saved as "keyword type". You can check that by issuing a mapping query.

GET /movies/_mapping

Hence we can use the below query to find the exact movie by title.

GET /movies/_search
{
  "query": {
    "term": {
      "title.keyword": "The Lord of the Rings: The Fellowship of the Ring"
    }
  }
}

Term search

However, the text values are mostly used in full-text searches which gives relevant results. When we use term searches, it gives us exact yes/no results.
Let's search for movies by the year.

GET /movies/_search
{
  "query": {
    "term": {
      "year": {
        "value": 2020
      }
    }
  }
}

Range search

Let's use range query to find movies with rating greater than or queal 9.

GET /movies/_search
{
  "query": {
    "range": {
      "rating": {
        "gte": 9
      }
    }
  }
}

GET /movies/_search
{
  "query": {
    "range": {
      "rating": {
        "gte": 8.5,
        "lt": 9
      }
    }
  }
}

Bool query

We can use "bool" query to combine multiple filter criteria.

GET /movies/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "rating": {
              "gte": 8.5
            }
          }
        },
        {
          "range": {
            "year": {
              "gte": 2000
            }
          }
        }
      ]
    }
  }
}

We can use "match" query with bool as well.

GET /movies/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "rating": {
              "gte": 8
            }
          }
        }
      ],
      "must": [
        {
          "match": {
            "actors.keyword": "Russell Crowe"
          }
        }
      ]
    }
  }
}

Full-text search

This shows the real power of Elasticsearch as traditional databases would really struggle with this.

GET /movies/_search
{
  "query": {
    "match": {
      "title": "lord ring"
    }
  }
}

It would be more fun to search the plot.

GET /movies/_search
{
  "query": {
    "match": {
      "plot": "ring king"
    }
  }
}

By default Elasticsearch sorts matching results by their relevance score, that is, by how well each document matches the query.

This concept of relevance is important to Elasticsearch and is a concept that is completely foreign to traditional relational databases, in which a record either matches or it doesn’t.

GET /movies/_search
{
  "query": {
    "match": {
      "title": "lord ring"
    }
  },
  "sort": [
    {
      "rating": {
        "order": "desc"
      }
    }
  ]
}

Phrase search

Sometimes we want to match exact sequences of words or phrases. Then we can use match_phrase.

GET /movies/_search
{
  "query": {
    "match_phrase": {
      "title": "the lord of the rings"
    }
  }
}

Highlighting our searches

Sometimes our application need to highlight the matching snippets. In that scenario, we can use "highlight" to get matching text in highlighted format.

GET /movies/_search
{
  "query": {
    "match": {
      "plot": "ring king"
    }
  },
  "highlight": {
    "fields": {
      "plot": {}
    }
  }
}

We have come a long way but still, there is a lot to learn. Let's finish this article by searching for top war movies.

GET /movies/_search
{
  "query": {
    "match": {
      "genres.keyword": "War"
    }
  },
  "sort": [
    {
      "rating": {
        "order": "desc"
      }
    }
  ]
}

Pagination

As mentioned above, Elasticsearch limits the result by default. To go through all data, we can use "from" and "size" properties. Pay attention to the "from" field as it is not the page number but the number of records to skip.

GET /movies/_search
{
  "from": 10,
  "size": 5, 
  "query": {
    "match": {
      "genres.keyword": "War"
    }
  },
  "sort": [
    {
      "rating": {
        "order": "desc"
      }
    }
  ]
}

Filter vs Query

Note that I did not try to explain the concepts deeply. However, we better understand this.

Filter - return yes/no results
Query - return data in terms of relevance

Filters are cached hence will be faster in most situations. Also, text fields are not ideal as filters. Therefore use "match" instead of "term" for text fields or use the "keyword" field.

Conclusion

In this article, we played with a few Elasticsearch queries to learn its power. We understand that there is a lot to learn. Let's continue this in yet another article.

References

Term Level Queries
Term Query
Full-Text Queries
Bool Query
Bulk API

https://imdb-api.com/

Manju