2-Acquire Content

The first step is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed. That may be trivial, for example, if you’re indexing a set of XML files that resides in a specific directory in the file system or if all your content resides in a well organized database. Alternatively, it may be horribly complex and messy if the content is scattered in all sorts of places (file systems, content management systems, Microsoft Exchange, Lotus Domino, various websites, databases, local XML files, CGI scripts running on intranet servers, and so forth).
Using entitlements (which means allowing only specific authenticated users to see certain documents) can complicate content acquisition, because it may require “superuser” access when acquiring the content. Furthermore, the access rights or access control lists (ACLs) must be acquired along with the document’s content, and added to the document as additional fields used during searching to properly enforce the entitlements.
For large content sets, it’s important that this component be efficiently incremental, so that it can visit only changed documents since it was last run. It may also be “live,” meaning it’s a continuously running service, waiting for new or changed content to arrive and loading that content the moment it becomes available.

Links: elasticsearch mapper attachments

Advertisements