Greetings
In a previous article, we set up our environment to play with Elasticsearch. Let's try basic queries in this article by building a movie search.
In a previous article, we set up our environment to play with Elasticsearch. Let's try basic queries in this article by building a movie search.
Unlike other databases,
Elasticsearch gives full searching power. Let's slowly dig into that power. Hence this is not a deep dive.
I use the top 250 movies from imdb-api in this exercise.
Elasticsearch provides a rich, flexible, query language called the query DSL which allows us to build more complicated, robust queries.However, the above will return movies other than this exact text. That is how and why we use Elasticsearch. The search is not exact but relevant.
Hence we can use the below query to find the exact movie by title.
Let's search for movies by the year.
Term Query
Full-Text Queries
Bool Query
Bulk API
Movie database
Let's build a movie database using imdb-api and play with queries.
- Save movies as documents
- Movies to contain text, numbers, arrays, full text
- Structured search using movie ratings, year
- Full-text search (title, actors)
- Phrase search (plot, title)
- Return highlighted search snippets
Note: You can either use Kibana UI or a direct call to Elasticsearch. Hence I skip the URL path.
curl -XPUT -H "Content-Type:application/json" localhost:9200/movies/_doc/tt0068646 -d '
{
"title": "The Godfather",
"year": 1972,
"rating": 9.2
}'
Indexing movies
In Elasticsearch, the act of storing data is called indexing. It provides an Index API to manage data.Create the index
As for this exercise let's keep things simple and use dynamic mappings.PUT /movies
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "movies"
}
Verify the existence of the index using the below queryGET /_cat/indices
Save a movie
In Elasticsearch we can provide our own id or let it create an Id for us. We will use an Elasticsearch-created id. Hence we use the POST API.POST /movies/_doc
{
"id": "tt0068646",
"title": "The Godfather",
"year": 1972,
"rating": 9.2,
"genres": ["Crime", "Drama"],
"actors": [
"Francis Ford Coppola",
"Marlon Brando",
"Al Pacino",
"James Caan",
"Diane Keaton"
],
"plot": "The aging patriarch of an organized crime dynasty in postwar New York City transfers control of his clandestine empire to his reluctant youngest son."
}
{
"_index": "movies",
"_id": "5pqy1YYBcKgBsB5lhM3d",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
Or, we can provide our own id.PUT /movies/_doc/tt0068646
{
"id": "tt0068646",
"title": "The Godfather",
"year": 1972,
"rating": 9.2,
"genres": ["Crime", "Drama"],
"actors": [
"Francis Ford Coppola",
"Marlon Brando",
"Al Pacino",
"James Caan",
"Diane Keaton"
],
"plot": "The aging patriarch of an organized crime dynasty in postwar New York City transfers control of his clandestine empire to his reluctant youngest son."
}
Get a movie
Use the GET endpoint to retrieve the movie by id.GET /movies/_doc/5pqy1YYBcKgBsB5lhM3d
Or, with our own id,
GET /movies/_doc/tt0068646
Delete a movie
Same as above, we can use DELETE to delete a movie.DELETE /movies/_doc/5pqy1YYBcKgBsB5lhM3d
Search
In order to practice search queries, we need a larger data set. Hence I have extracted the top 250 movies from imdb-api. You can find the dataset in my GitHub repo. We can use the bulk API to insert all at once. Note that this special JSON format is ndjson.Dataset: Top 250 Movies
curl -s --header "Content-Type:application/json" -X POST localhost:9200/movies/_bulk --data-binary @movies-bulk.json
Or, with the latest version,curl -s -XPOST localhost:9200/movies/_bulk -H "Content-Type:application/x-ndjson" --data-binary @movies-bulk.json
Now again try the search query. In Elasticsearch, all the information that we need, in order to display the search results to the user, is returned. This by default return only the first 10 records.GET /movies/_search
{
"query": {
"match_all": {}
}
}
GET /movies/_search
{
"query": {
"match": {
"title": "The lord of the ring"
}
}
}
Structured search
We need to use "term query" to find exact values. This is ideal for numbers and dates but not for text fields because text fields will be analyzed. The exact value for text fields is saved as "keyword type". You can check that by issuing a mapping query.GET /movies/_mapping
GET /movies/_search
{
"query": {
"term": {
"title.keyword": "The Lord of the Rings: The Fellowship of the Ring"
}
}
}
Term search
However, the text values are mostly used in full-text searches which gives relevant results. When we use term searches, it gives us exact yes/no results.Let's search for movies by the year.
GET /movies/_search
{
"query": {
"term": {
"year": {
"value": 2020
}
}
}
}
Range search
Let's use range query to find movies with rating greater than or queal 9.GET /movies/_search
{
"query": {
"range": {
"rating": {
"gte": 9
}
}
}
}
GET /movies/_search
{
"query": {
"range": {
"rating": {
"gte": 8.5,
"lt": 9
}
}
}
}
Bool query
We can use "bool" query to combine multiple filter criteria.
GET /movies/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"rating": {
"gte": 8.5
}
}
},
{
"range": {
"year": {
"gte": 2000
}
}
}
]
}
}
}
We can use "match" query with bool as well.
GET /movies/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"rating": {
"gte": 8
}
}
}
],
"must": [
{
"match": {
"actors.keyword": "Russell Crowe"
}
}
]
}
}
}
Full-text search
This shows the real power of Elasticsearch as traditional databases would really struggle with this.GET /movies/_search
{
"query": {
"match": {
"title": "lord ring"
}
}
}
It would be more fun to search the plot.
GET /movies/_search
{
"query": {
"match": {
"plot": "ring king"
}
}
}
By default Elasticsearch sorts matching results by their relevance score, that is, by how well each document matches the query.
This concept of relevance is important to Elasticsearch and is a concept that is completely foreign to traditional relational databases, in which a record either matches or it doesn’t.GET /movies/_search
{
"query": {
"match": {
"title": "lord ring"
}
},
"sort": [
{
"rating": {
"order": "desc"
}
}
]
}
Phrase search
Sometimes we want to match exact sequences of words or phrases. Then we can use match_phrase.GET /movies/_search
{
"query": {
"match_phrase": {
"title": "the lord of the rings"
}
}
}
Highlighting our searches
Sometimes our application need to highlight the matching snippets. In that scenario, we can use "highlight" to get matching text in highlighted format.GET /movies/_search
{
"query": {
"match": {
"plot": "ring king"
}
},
"highlight": {
"fields": {
"plot": {}
}
}
}
We have come a long way but still, there is a lot to learn. Let's finish this article by searching for top war movies.
GET /movies/_search
{
"query": {
"match": {
"genres.keyword": "War"
}
},
"sort": [
{
"rating": {
"order": "desc"
}
}
]
}
Pagination
As mentioned above, Elasticsearch limits the result by default. To go through all data, we can use "from" and "size" properties. Pay attention to the "from" field as it is not the page number but the number of records to skip.
GET /movies/_search
{
"from": 10,
"size": 5,
"query": {
"match": {
"genres.keyword": "War"
}
},
"sort": [
{
"rating": {
"order": "desc"
}
}
]
}
Filter vs Query
Note that I did not try to explain the concepts deeply. However, we better understand this.
- Filter - return yes/no results
- Query - return data in terms of relevance
Filters are cached hence will be faster in most situations. Also, text fields are not ideal as filters. Therefore use "match" instead of "term" for text fields or use the "keyword" field.
Conclusion
In this article, we played with a few Elasticsearch queries to learn its power. We understand that there is a lot to learn. Let's continue this in yet another article.References
Term Level QueriesTerm Query
Full-Text Queries
Bool Query
Bulk API
Comments
Post a Comment