2 – Documents

In Elasticsearch, the term document has a specific meaning. It refers to the top-level, or root object that is serialized into JSON and stored in Elasticsearch under a unique ID.

A document doesn’t consist only of its data. It also has metadata—information about the document. The three required metadata elements are as follows:

_index
Where the document lives
_type
The class of object that the document represents
_id
The unique identifier for the document

In Elasticsearch, our data is stored and indexed in shards, while an index is just a logical namespace that groups together one or more shards. However, this is an internal detail; our application shouldn’t care about shards at all. As far as our application is concerned, our documents live in an index. Elasticsearch takes care of the details. All we have to do is choose an index name. This name must be lowercase, cannot begin with an underscore, and cannot contain commas. Let’s use website as our index name.

Data may be grouped loosely together in an index, but often there are sub-partitions inside that data which may be useful to explicitly define. For example, all your products may go inside a single index. But you have different categories of products, such as “electronics”, “kitchen” and “lawn-care”.

The documents all share an identical (or very similar) schema: they have a title, description, product code, price. They just happen to belong to sub-categories under the umbrella of “Products”.

Elasticsearch exposes a feature called types which allows you to logically partition data inside of an index. Documents in different types may have different fields, but it is best if they are highly similar. We’ll talk more about the restrictions and applications of types in Types and Mappings.

A _type name can be lowercase or uppercase, but shouldn’t begin with an underscore or period. It also may not contain commas, and is limited to a length of 256 characters. We will use blog for our type name.

The ID is a string that, when combined with the _index and _type, uniquely identifies a document in Elasticsearch. When creating a new document, you can either provide your own _id or let Elasticsearch generate one for you.

curl -XPUT ‘localhost:9200/website/blog/123?pretty’ -H ‘Content-Type: application/json’ -d’
{
“title”: “My first blog entry”,
“text”: “Just trying this out…”,
“date”: “2014/01/01”
}

To get the document out of Elasticsearch, we use the same _index, _type, and _id, but the HTTP verb changes to GET:

curl -XGET ‘localhost:9200/website/blog/123?pretty&pretty’

By default, a GET request will return the whole document, as stored in the _source field. If you only want the title and text fields, individual fields can be requested by using the _source parameter. Multiple fields can be specified in a comma-separated list:

curl -XGET ‘localhost:9200/website/blog/123?_source=title,text&pretty’

Or if you want just the _source field without any metadata, you can use the _source endpoint:

curl -XGET ‘localhost:9200/website/blog/123/_source?pretty’

If all you want to do is to check whether a document exists—you’re not interested in the content at all—then use the HEAD method instead of the GET method. HEAD requests don’t return a body, just HTTP headers:

curl -i -XHEAD http://localhost:9200/website/blog/123

Documents in Elasticsearch are immutable; we cannot change them. Instead, if we need to update an existing document, we reindex or replace it, which we can do using the same index API

curl -XPUT ‘localhost:9200/website/blog/123?pretty’ -H ‘Content-Type: application/json’ -d’
{
“title”: “My first blog entry”,
“text”: “I am starting to get the hang of this…”,
“date”: “2014/01/02”
}

Advertisements