Up

Data Persistence


In the 2020/21 update the BRERC Sites application was retired. But the principles here are still valid.

Native data can be made completely persistent by means of timelines and archiving. Please study the explanation below if you are not already convinced that this is achievable in a live-booting environment.

Some Background

Diskless computing may seem rather precarious. In reality lasting persistence doesn't come from the working medium but from archives. This is how all enterprise database servers work, even though they generally incorporate disk-based elements.

LIVE Services needs persistence so it can host applications like BRERC Sites. Previously it was sufficient to recreate the database at start-up (and 6am every day) using a dBase feed. This still happens but now there is an additional mechanism, called checkpointing to persist native data.

[Note that my own particular terminology in this area diverges slightly from the norm. I am using the term 'checkpointing' to denote the combination of commit points and backup and recovery provision.]

Checkpointing requires all database updates to be isolated in a transaction log. A checkpoint is when a batch of updates is marked for archival. Archiving is the process of backing up a batch of updates, to one or more (possibly remote) destinations, so that the work may be considered committed. Recovery to a particular point-in-time (typically the last consistent state) involves restoring archives and replaying transactions logged.

The Sites Timeline

The BRERC Sites application has a mechanism called the 'timeline' or 'transaction log' (site_tl) which performs several important functions. It simulates adhoc creation of attributes on a per-site basis - ie to an extent not possible even with spreadsheets. Currently note-based and event-based attributes are supported. And it contains a complete audit trail of changes applied to the site table. Whilst specific to sites this mechanism could easily be adapted to other applications (like habitat or species).

The site_tl table is maintained automatically by triggers associated with the site table, and is sufficient to recreate the site table whenever and wherever necessary. Crucially then site_tl already contains the data to be archived. Checkpoints (special rows) are inserted into the log to mark off the batches to be archived. Each one has an 'archival status' to indicate whether or not it was successfully archived. Default status (at creation) is 'pending', changing to 'ok' when archived.

Timeline Status
Here timeline refers to a data entity with persistence (eg site). Status is 'empty' at start, 'ready' when initialized, then 'consistent' when some transactions have been successfully logged and/or recovered. Consistent simply means that rowids run contiguously. The checkpoint date and time (in the log, see below) will show how up-to-date the timeline is - there is no way yet to assign a status of 'complete'. A status of 'incomplete' means that logs are definitely missing, and should be investigated.

Checkpoint Log
For each timeline there are two entries in the table. The first is a marker at the start of the data - it shows when the timeline was last (re)created, typically at a 6am general reload. At this point the database contains no data so the rowid is 1. The second shows the latest of all subsequent checkpoints (if any), which enclose batches of real updates. An archival status of 'pending' should be investigated - it means that persistence is only up until the next reboot.

Archiving

The Backup Procedure
This is performed by the checkpointing shell script, which is scheduled to run once a minute, and proceeds as follows.
Note that if archiving fails for any reason batches preserved locally are of no real use, and will be superceded automatically when archiving resumes.

Recovery at 6am (Warm)
At 6am (or on demand) the database is recreated, so the sites data must be recovered. This is done without recourse to archived data, and therefore does not depend on archiving working. We need to ...
Recovery at Startup (Cold)
At startup (reboot) archived data will be required to recover fully with no data loss. We need to ...
The first time around there were no archives - legacy data had to be loaded manually as part of the deployment procedure (now history).

Archiving Failure
Checkpointing is designed to be fail-safe with regard to archiving - as far as possible. There is a measure of 'local persistence' provided by the daily site_tl dump, as explained above. If sites data is being edited a watch needs to be kept on checkpoint archival status to ensure it is 'ok'. If it is 'pending' the system must not be restarted until archiving has been re-established (status back to 'ok'). When archiving is re-established all outstanding transactions logged will automatically be archived. This happens because, as explained above, archive batches always start after a successful checkpoint. Of course, until this happens sites data cannot be fully recovered in the event of a system crash. The only other thing to note is that whenever checkpointing does fail the system will not try again until new transactions appear - it would be undesirable to keep trying every minute after a failure.

Archiving Limitations in This Release
We also assume that the archive is complete. This is reasonable because a copy of each batch is emailed offsite and compared with the archived version.

Up