3.3

Getting Started Configuration Endpoints Images Resolvers Processors Caching Access Control Metadata Color Profiles Overlays Redaction Delegate Script Logging Deployment Remote Management

This version of the manual refers to an earlier version of the software.

Caching

Client-Side Caching

Server-Side Caching

Source Cache
Derivative Cache
- Bypassing
Modes of Operation
Maintenance
- Manual
- Automatic
Limiting
Implementations

Cantaloupe offers a sophisticated and customizable caching subsystem that is capable of meeting a variety of needs while remaining easy to use. Three tiers of cache are available:

Client-side caches, which it has no control over but can provide hints to;
A source cache, which caches source images locally on-demand (if they are not already local) for faster reading;
A derivative cache, which caches processed images and source image metadata such as dimensions.

Client-Side Caching

Cantaloupe can provide caching hints to clients using a Cache-Control response header, which is configurable via the cache.client.* keys in the configuration file. To enable this header, set the cache.client.enabled key to true.

The default settings look something like this:

cache.client.max_age = 2592000
cache.client.shared_max_age =
cache.client.public = true
cache.client.private = false
cache.client.no_cache = false
cache.client.no_store = false
cache.client.must_revalidate = false
cache.client.proxy_revalidate = false
cache.client.no_transform = true

These are reasonable defaults that tell clients they can keep cached images for 30 days (2592000 seconds).

Note: the Cache-Control header must have a particular structure—not just any combination of the above will work. See this brief overview, for example.

The Cache-Control header works as follows:

cache.client.enabled = true

HTTP 2xx responses: The Cache-Control header is returned according to the cache.client.* keys in the configuration.
HTTP 3xx responses: No Cache-Control header.
HTTP 4xx & 5xx responses: Cache-Control: must-revalidate,no-cache,no-store

cache.client.enabled = false

HTTP 2xx & 3xx responses: No Cache-Control header.
HTTP 4xx & 5xx responses: Cache-Control: must-revalidate,no-cache,no-store

Server-Side Caching

Source Cache

In a typical image server configuration, source images will be served from a local filesystem using FilesystemResolver. There, they are already as local as they can be, so there would be no point in caching them (although a derivative cache could still be of great benefit).

As explained in the Resolvers section, though, images do not have to be served from a local filesystem—they can also be served from a remote web server, cloud storage, or what have you. The source cache can be beneficial when one of these non-filesystem sources performs poorer than ideal. Setting cache.source to FilesystemCache will cause all source images from non-FilesystemResolvers to be automatically downloaded and stored in the source cache.

Another reason for a source cache is to work around the incompatibility between certain processors and resolvers. Some processors are only capable of reading source images located on the filesystem. By setting StreamProcessor.retrieval_strategy to CacheStrategy, and then configuring FilesystemCache, the source cache will be utilized to deal with incompatible processor/resolver situations by automatically pre-downloading source images, This makes it possible to use something like OpenJpegProcessor with AmazonS3Resolver.

Idealy, all cloud services and so on would offer faster-than-light-latency seekable-stream access, all image readers would be able to read from them as efficiently as from the local filesystem, and there would be no need to deal with the added complexity of a source cache. But, that is not the reality. Cantaloupe tries to keep things simple by integrating the source cache into the larger caching architecture, so all of the information about modes of operation and maintenance is applicable to both the source and derivative caches.

Note that unlike the derivative cache, there is only one available source cache implementation—FilesystemCache—and it will be used independently of the derivative cache.

Derivative Cache

The derivative cache caches post-processed images in order to spare the computational expense of processing the same image request over and over again. Derivative caches are pluggable, in order to enable different cache stores.

Derivative caching is recommended in production, as it will greatly reduce load on the server and improve response times accordingly. There are other ways of caching derivatives, such as by using a caching reverse proxy, but the built-in derivative cache is custom-tailored for this application and easy enough to set up.

Derivative caching is disabled by default. To enable it, set cache.derivative to the name of a cache, such as FilesystemCache.

Bypassing

The derivative cache can be bypassed on a per-request basis by supplying a cache=false query parameter in the URL. When this parameter is present, the derivative cache will not be read from, nor written to, whether or not it is enabled. The Cache-Control header will also be omitted from responses.

Notes

Requests for full-sized, unaltered source images are not cached, and are instead streamed through with no processing.
Entire IIIF information response representations are not cached—only image metadata, which is the only expensive part to generate. This means it is possible to change other configuration options that would affect the contents of information responses without invalidating the cache.
When derivative caching is enabled, "miss" responses are streamed to the client and cache simultaneously. If the cache I/O is slower than the connection to the client, response times may be adversely affected.
The derivative cache is shared across endpoints. Requests for the same image from different endpoints will return the same cached image.

Modes of Operation

The source and derivative caches can be configured to operate in one of two ways:

Conservative (cache.server.resolve_first = true): Source images are looked up and verified to exist before cached representations are returned. This precludes returning a cached representation when the underlying resource no longer exists, but also impairs response times by a variable amount, depending on the resolver.
Aggressive (cache.server.resolve_first = false): Cached representations are returned immediately, if available. This is faster, but inconsistency can develop between the cache and the underlying source image storage, if the latter is not static.

Maintenance

Because cached content is not automatically deleted after expiring, there is likely to be a certain amount of expired content taking up space in the cache at any given time. Without periodic maintenance, the amount can only grow. If this is a problem, it can be dealt with manually or automatically.

Manual

To purge all expired content, launch with the -Dcantaloupe.cache.purge_expired option.

To purge all content, expired or not, launch with the -Dcantaloupe.cache.purge option.

To purge all content related to a given identifier, expired or not, there are two options:

Launch with the -Dcantaloupe.cache.purge=identifier option.
Use the REST API.

(Both of these were added in 3.3.)

Caches are careful not to leave miscellaneous detritus (like temp files) lying around. In case anything slips through, the above commands will take care of it. To only clean the cache while leaving all content alone, expired or not, launch with the -Dcantaloupe.cache.clean option.

When Cantaloupe is launched with any of these arguments, it will run in a special mode in which the web server will not be started, and exit when done. Thus, any of these tasks can be run in a separate process, on the live cache store, while the main server instance remains running.

Automatic

Since version 2.2, a "cache worker" is available that will periodically clean and purge expired items from the cache automatically. (See the cache.server.worker.* configuration options.)

Limiting

Depending on the amount of source content served, the varieties of derivatives generated, the time-to-live setting, and how often maintenance is performed, the cache may grow very large. The image server does not track its size, as this would be either expensive, or, for some cache implementations, impossible. Managing the cache size is therefore the responsibility of the administrator, and it can be accomplished by any combination of:

Performing maintenance more often;
Reducing the time-to-live (using the cache.server.ttl_seconds configuration key);
Increasing the threshold by allocating more storage to the cache.

Implementations

FilesystemCache

FilesystemCache caches content in a filesystem tree. The tree structure looks like:

FilesystemCache.pathname/
- source/ ⁽¹⁾
  - Intermediate subdirectories ⁽²⁾
    - {hashed identifier} ⁽³⁾
- image/
  - Intermediate subdirectories ⁽²⁾
    - {hashed identifier}{operation list string representation}.{output format extension} ⁽³⁾
- info/
  - Intermediate subdirectories ⁽²⁾
    - {hashed identifier}.json ⁽³⁾

Empty unless source caching is enabled.
Some filesystems have per-directory file count limits, or thresholds beyond which performance starts to degrade. To work around this, cache files are stored in subdirectory trees consisting of leading fragments of identifier MD5 hashes, configurable by FilesystemCache.dir.depth and FilesystemCache.dir.name_length.
Identifiers in filenames are MD5-hashed in order to allow for identifiers longer than the filesystem's filename length limit.

Cache files are created with a .tmp extension and moved into place when closed for writing.

FilesystemCache is process-safe: it is safe to point multiple server instances at the same cache directory.

JdbcCache

JdbcCache caches derivative images and metadata in relational database tables. To use this cache, a JDBC driver for your database must be installed on the classpath.

JdbcCache has been tested with H2 1.4. It is known to not work with the official PostgreSQL driver, as of version 9.4.1207. Other databases may work, but are untested.

JdbcCache can be configured with the following options:

JdbcCache.url: JDBC connection URL; for example, jdbc:postgresql://localhost:5432/mydatabase.
JdbcCache.user: User to connect to the database as.
JdbcCache.password: Password to use when connecting to the database. Can be left blank if not needed.
JdbcCache.image_table: Table in which to cache derivative (post-processed) images.
JdbcCache.info_table: Table in which to cache information responses.

JdbcCache will not create its schema automatically—this must be done manually using the following commands, which may have to be altered slightly for your particular database:

CREATE TABLE IF NOT EXISTS {JdbcCache.derivative_image_table} (
   operations VARCHAR(4096) NOT NULL,
   image BLOB,
   last_accessed DATETIME
);

CREATE TABLE IF NOT EXISTS {JdbcCache.info_table} (
  identifier VARCHAR(4096) NOT NULL,
  info VARCHAR(8192) NOT NULL,
  last_accessed DATETIME
);

CREATE INDEX operations_idx ON {JdbcCache.derivative_image_table} (operations);
CREATE INDEX identifier_idx ON {JdbcCache.info_table} (identifier);

JdbcCache uses write transactions and is process-safe: it is safe to point multiple server instances at the same database tables.

AmazonS3Cache

AmazonS3Cache caches derivative images and metadata into an Amazon Simple Storage Service (S3) bucket. It can be configured with the following options:

AmazonS3Cache.access_key_id: An access key associated with your AWS account. (See AWS Security Credentials.)
AmazonS3Cache.secret_key: A secret key associated with your AWS account. (See AWS Security Credentials.)
AmazonS3Cache.bucket.name: Name of the bucket to contain cached content.
AmazonS3Cache.bucket.region: Name of a region to send requests to, such as us-east-1. Can be commented out or left blank to use a default region. (See S3 Regions.)
AmazonS3Cache.object_key_prefix: String to prepend to object keys—for example, to achieve a virtual folder hierarchy.

Note: Amazon S3 does not provide a last-accessed time in object metadata, meaning that the time-to-live will be on the basis of last-modified time (generally the same as creation time) instead.

AzureStorageCache

AzureStorageCache caches derivative images and metadata into a Microsoft Azure Storage container. It can be configured with the following options:

AzureStorageCache.account_name: The name of your Azure account.
AzureStorageCache.account_key: A key to access your Azure Storage account.
AzureStorageCache.container_name: Name of the container from which to serve images.
AzureStorageCache.object_key_prefix: String to prepend to object keys—for example, to achieve a virtual folder hierarchy.

Note: Azure Storage does not provide a last-accessed time in object metadata, meaning that the time-to-live will be on the basis of last-modified time (generally the same as creation time) instead.