1. User guide

This is an overview of using Metadb. We assume familiarity with databases and the basics of SQL.

1.1. Getting started

Metadb extends PostgreSQL with features to support analytics such as streaming data sources, data model transforms, and historical data. The data contained in the Metadb database originally come from another place: a data source which could be, for example, a transaction-processing database or a sensor network. Metadb updates its database continuously based on state changes in external data sources.

1.2. Main tables

Tables generated by Metadb have at least these metadata columns, with names that begin with two underscores:

  • __id is a surrogate key that identifies a row in the table.

  • __start is the date and time when the row of data was generated.

  • __end is the date and time when the row of data became no longer current, or 9999-12-31 00:00:00+00 if the row is still current.

  • __current is a Boolean value that indicates whether the row is current.

  • __origin is an optional identifier used to group related data from one or more data sources. For example, Metadb can combine data from similar source tables into a single table in the analytic database, and stores an identifier in __origin to record where the data came from.

A main table in Metadb has two underscores at the end of its name (e.g., patrongroup__), and it contains both the current state and the history of all previous states provided by the data source. For instance, consider a main table that stores categories of patrons in a library:

SELECT __start, __end, __current, id, groupname, description FROM library.patrongroup__;
__start __end __current id groupname description

2022-04-17 21:42:25-00

2022-04-18 19:27:18-00

FALSE

15

undergrad

Student

2022-04-18 19:27:18-00

9999-12-31 00:00:00+00

TRUE

15

undergrad

Undergraduate Student

2022-04-17 21:42:25-00

9999-12-31 00:00:00+00

TRUE

10

graduate

Graduate Student

2022-04-17 21:52:53-00

9999-12-31 00:00:00+00

TRUE

9

faculty

Faculty Member

2022-04-17 21:52:53-00

9999-12-31 00:00:00+00

TRUE

12

staff

Staff Member

In this example, the "undergrad" group with id = 15 has two rows because it was modified on 2022-04-18 19:27:18-00, changing description from 'Student' to 'Undergraduate Student'.

1.3. Current tables

It is often desirable to limit a query to retrieving only current records. This can be done by filtering on __current = TRUE, for example:

SELECT id, groupname, description FROM library.patrongroup__ WHERE __current;

For convenience, since this filter is so often used, Metadb provides access to a current table which contains only current records. Every main table has a corresponding current table, which shares the same name but without the two underscores at the end. For instance, the following query is equivalent to the example above:

SELECT id, groupname, description FROM library.patrongroup;

A current table reflects the current state of each row in the data source, as of the last time the row was updated.

Note

In a few database client applications, current tables do not appear by default in the list of tables. This is because main tables and current tables have a parent-child relationship, and database user interfaces may show the current table as subsumed under the main table.

To take another example:

SELECT __id, __start, __origin, id, groupname, description FROM library.patrongroup;
__id __start __origin id groupname description

8

2022-04-18 19:27:18-00

west

15

undergrad

Undergraduate Student

4

2022-04-17 17:42:25-00

east

10

graduate

Graduate Student

5

2022-04-17 17:52:53-00

east

9

faculty

Faculty Member

6

2022-04-17 17:52:53-00

east

12

staff

Staff Member

There is a distinction between the __id and id columns. The __id column is a key defined by Metadb to identify each row uniquely; it is present in all tables generated by Metadb. In contrast, the id column only happens to be present in this sample table and has been provided by the data source; it may or may not be a key, and its name, content, or significance may change if schema changes are made in the data source.

1.4. Transformed tables

Metadb applies transformations to some kinds of data, which results in additional tables being created. One category of data that are transformed in this way is JSON objects. Suppose that our patrongroup current table contains the groupname and description data in JSON fields, for example:

SELECT __id, __start, id, jsondata FROM library.patrongroup;
__id __start id jsondata

8

2022-04-18 19:27:18-00

15

{
    "groupname": "undergrad",
    "description": "Undergraduate Student"
}

4

2022-04-17 17:42:25-00

10

{
    "groupname": "graduate",
    "description": "Graduate Student"
}

5

2022-04-17 17:52:53-00

9

{
    "groupname": "faculty",
    "description": "Faculty Member"
}

6

2022-04-17 17:52:53-00

12

{
    "groupname": "staff",
    "description": "Staff Member"
}

The JSON fields can be selected using SQL:

SELECT __id,
       __start,
       id,
       jsonb_extract_path_text(jsondata, 'groupname') AS groupname,
       jsonb_extract_path_text(jsondata, 'description') AS description
    FROM library.patrongroup;

However, Metadb also generates a transformed table called patrongroup__t. Transformed tables based on JSON objects contain JSON fields extracted into columns, which allows for simpler queries.

SELECT __id, __start, id, groupname, description FROM library.patrongroup__t;
__id __start id groupname description

2

2022-04-18 19:27:18-00

15

undergrad

Undergraduate Student

5

2022-04-17 17:42:25-00

10

graduate

Graduate Student

4

2022-04-17 17:52:53-00

9

faculty

Faculty Member

6

2022-04-17 17:52:53-00

12

staff

Staff Member

In the current version of Metadb, only top-level, scalar JSON fields are extracted into transformed tables.

Main tables are also transformed in the same way. In this case the main transformed table would be called patrongroup__t__.

Note that JSON data are treated as "schemaless," and fields are inferred from their presence in the data rather than read from a JSON schema. As a result, a column is only created from a JSON field if the field is present in at least one JSON record.

1.5. Comparing table types

To summarize the types of tables that we have covered:

Table name Table type Description
patrongroup__

Main table

Current and historical records

patrongroup

Current table

Current records only

patrongroup__t__

Main transformed table

Transformed versions of the records in patrongroup__

patrongroup__t

Current transformed table

Transformed versions of the records in patrongroup

In addition to these table types, some partition tables have names that begin with zzz___. It is recommended that these tables not be used directly in queries, and access to them is generally disabled.

1.6. User workspaces

In general, users do not have privileges to modify or create tables in the database. However, Metadb creates a schema for each user where the user does have these privileges, and this serves as an individual workspace. The schema has the same name as the user name. This can be useful as an area for saving the results of queries or importing external data sets. For example, the user celia can create a table in the schema celia:

CREATE TABLE celia.westgroup AS
    SELECT * FROM library.patrongroup WHERE __origin = 'west';
SELECT * FROM celia.westgroup LIMIT 20;

Other users typically do not have privileges to access the table by default.

To grant the user rosalind read-only access to the schema and table:

GRANT USAGE ON SCHEMA celia TO rosalind;

GRANT SELECT ON celia.westgroup TO rosalind;

The user rosalind can then access the table.

1.7. Working with data types

This section offers a few brief recommendations for working with common data types.

Text and varchar

The text data type is recommended for strings of characters.

The varchar(n) type with length n can be used for older database clients that require the length to be specified.

Monetary data

For monetary amounts, numeric(19, 4) is usually a good choice. For exchange rates, numeric(19, 14) may be used.

Querying on a range of dates

There are various ways of querying within a range of dates, but the least error-prone is to use inequalities of the form start <= t < end, for example:

SELECT *
    FROM folio_circulation.loan__t
    WHERE '2023-01-01' <= loan_date AND loan_date < '2024-01-01';

1.8. Creating reports

An effective way to create a report is to package it as a database function. A database function can define a query and associated parameters. Users can then call the function, specifying a value for each parameter.

For example, suppose that the following query counts the number of loans in a library for each circulated item within a range of dates.

SELECT item_id,
       count(*) AS loan_count
    FROM folio_circulation.loan__t
    WHERE '2023-01-01' <= loan_date AND loan_date < '2024-01-01'
    GROUP BY item_id;

We can create a function to generalize this query. Instead of including the dates directly within the query, we will define them as parameters: start_date and end_date.

CREATE FUNCTION lisa.count_loans(
    start_date date DEFAULT '2000-01-01',
    end_date date DEFAULT '2050-01-01')
RETURNS TABLE(
    item_id uuid,
    loan_count integer)
AS $$
SELECT item_id,
       count(*) AS loan_count
    FROM folio_circulation.loan__t
    WHERE start_date <= loan_date AND loan_date < end_date
    GROUP BY item_id
$$
LANGUAGE SQL
STABLE
PARALLEL SAFE;

Now the function can be called with different arguments to generate reports:

SELECT * FROM lisa.count_loans(start_date => '2022-01-01', end_date => '2023-01-01');

SELECT * FROM lisa.count_loans(start_date => '2023-01-01');

The user lisa who created this function can share it with other users in a similar way as described in the "User workspaces" section above.

GRANT USAGE ON SCHEMA lisa TO celia, rosalind;

GRANT EXECUTE ON FUNCTION lisa.count_loans TO celia, rosalind;

Note that the GRANT USAGE ON SCHEMA command does not have to be run again if it was done previously.

Defining shared functions in this way can be used together with a web-based database tool such as CloudBeaver to make reports available to a wider group of users.

1.9. Database views

Database views are not supported and should not be created in a Metadb database.

Views can create dependencies that may block required schema changes during data updates. They also hide query complexity, which in analytical workloads can be significant, behind what appears to be a simple table. It is recommended to use database functions instead, as described in the previous section.

1.10. Querying system information

Metadb version

To show the current Metadb version:

SELECT mdbversion();

Data update times

Information about when certain tables were last updated is available in a table called metadb.table_update:

SELECT * FROM metadb.table_update ORDER BY schemaname, tablename;

System log

Metadb reports errors and other information in a logging table called metadb.log. For instance to view all log messages from January, 2023:

SELECT *
    FROM metadb.log
    WHERE '2023-01-01' <= log_time AND log_time < '2023-02-01'
    ORDER BY log_time;

The mdblog() function is provided for convenience and returns ordered log messages, from the past 24 hours by default (or optionally from a specified interval):

SELECT * FROM mdblog();

Status of query process

When executing a long-running query, the ps() function can be used to check on the status of the query process from another session:

SELECT * FROM ps();

2. Reference

2.1. Data type conversion

When reading data from a data source, Metadb will automatically adjust column data types in a number of cases:

  • A data type in the source has changed in a way that cannot be applied safely by Metadb without more information.

  • A data type in the source has changed to a type that is not supported by Metadb.

  • A data type in the source can be converted safely to a type that is more performant.

Each type conversion is applied in one of two ways: either by the stream processor converting a record-at-a-time, or later by "inferring" types from previously updated data.

2.1.1. Record-at-a-time

This table summarizes conversions that are performed a record-at-a-time by the stream processor:

Data type conversions To numeric To uuid To jsonb To text

From boolean

From smallint

From integer

From bigint

From real

From double precision

From numeric

From date

From time

From time with timezone

From timestamp

From timestamp with timezone

From uuid

From json

From jsonb

2.1.2. Inferred from data

At present the only inferred type is uuid:

Data type conversions To numeric To uuid To jsonb To text

From text/varchar

The command REFRESH INFERRED COLUMN TYPES can be used to infer data types.

Types can also be set manually via the ALTER TABLE command.

2.2. Functions

2.2.1. System information

Name Return type Description

mdblog(interval)

table (
    log_time timestamptz(3),
    error_severity text,
    message text
)

Returns recent log messages during the specified interval, or during the past 24 hours if no interval is given

mdbversion()

text

Returns the Metadb version

ps()

table (
    dbname text,
    username text,
    state text,
    realtime text,
    query text
)

Returns the status and elapsed running time of current query processes

Examples

Show log messages from the past hour, and from the past 24 hours:

SELECT * FROM mdblog('1 hour');
SELECT * FROM mdblog();

Show the current Metadb version:

SELECT mdbversion();

2.3. System tables

2.3.1. metadb.base_table

Metadb 1.2

The table metadb.base_table stores information about tables that are extracted from data sources or are transformed from those tables.

Column name Column type Description

schema_name

varchar(63)

The base table schema name

table_name

varchar(63)

The base table name

source_name

varchar(63)

Name of the data source the table is extracted from

transformed

boolean

True if the table is transformed from another table

parent_schema_name

varchar(63)

Schema name of the parent table, if this is a transformed table

parent_table_name

varchar(63)

Table name of the parent table, if this is a transformed table

2.3.2. metadb.log

The table metadb.log stores logging information for the system.

Column name Column type Description
log_time
timestamptz(3)

Timestamp when the log entry was written

error_severity
text

Logging level: (from least to most severe) INFO, WARNING, ERROR, FATAL

message
text

The log message

2.3.3. metadb.table_update

Metadb 1.2

The table metadb.table_update stores information about the updating of certain tables.

Column name Column type Description

schema_name

varchar(63)

Schema name of the updated table

table_name

varchar(63)

Table name of the updated table

last_update

timestamptz

Timestamp when the table was last updated successfully

elapsed_real_time

real

Wall-clock time in seconds of the last completed update process

2.4. External SQL directives

Metadb allows scheduling external SQL files to run on a regular basis.

Each SQL statement should be separated from others by an empty line, and any tables created should not specify a schema name.

Comment lines beginning with --metadb: are used for special directives; each directive should be on a separate line.

It is suggested that each SQL file begin with a --metadb:table directive, followed by an empty line, for example:

--metadb:table library_patrons

DROP TABLE IF EXISTS library_patrons;

CREATE TABLE library_patrons AS
SELECT . . . ;

2.4.1. --metadb:table

The --metadb:table directive declares that the SQL file updates a specific table. This allows Metadb to report on the status of the table. The directive takes the form:

--metadb:table <table>

The specified table should not contain a schema name.

For example:

--metadb:table user_group

2.5. Statements

Metadb extends SQL with statements for configuring and administering the server. These statements are only available when connecting to the Metadb server (not the database).

2.5.1. ALTER DATA SOURCE

Change the configuration of a data source

ALTER DATA SOURCE source_name
    OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, ... ] )
Description

ALTER DATA SOURCE changes connection settings for a data source.

Note

ALTER DATA SOURCE currently requires restarting the server before it will take effect.

Parameters

source_name

The name of an existing data source.

OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, …​ ] )

Connection settings and other configuration options for the data source.

Options

See CREATE DATA SOURCE

Examples

Change the consumer group:

ALTER DATA SOURCE sensor OPTIONS (SET consumergroup 'metadb_sensor_1');

2.5.2. ALTER TABLE

Metadb 1.2

Change a table definition

ALTER TABLE table_name
    ALTER COLUMN column_name TYPE data_type
Description

ALTER TABLE changes the definition of a table that is extracted from a data source.

Parameters

table_name

Schema-qualified name of a main table.

column_name

Name of a column to alter.

data_type

The new data type of the column. The only type currently supported is uuid.

Examples

Change a column type to uuid:

ALTER TABLE library.patron__ ALTER COLUMN patrongroup_id TYPE uuid;

2.5.3. AUTHORIZE

Enable access to tables generated from an external data source

AUTHORIZE SELECT
    ON ALL TABLES IN DATA SOURCE source_name
    TO role_specification
Description

The AUTHORIZE command grants access to tables. It differs from GRANT in that the authorization will also apply to tables created at a later time in the data source.

Note

AUTHORIZE currently requires restarting the server before it will take effect.

Parameters

source_name

The name of an existing data source.

role_specification

An existing role to be granted the authorization.

Examples
AUTHORIZE SELECT
    ON ALL TABLES IN DATA SOURCE sensor
    TO beatrice;

CREATE DATA ORIGIN

Define a new data origin

CREATE DATA ORIGIN origin_name
Description

CREATE DATA ORIGIN defines a new origin.

Note

CREATE DATA ORIGIN currently requires restarting the server before it will take effect.

Parameters

origin_name

A unique name for the data origin to be created.

Examples

Create a new origin test_origin:

CREATE DATA ORIGIN test_origin;

2.5.4. CREATE DATA SOURCE

Define a new external data source

CREATE DATA SOURCE source_name TYPE source_type
    OPTIONS ( option 'value' [, ... ] )
Description

CREATE DATA SOURCE defines connection settings for an external data source.

The new data source starts out in synchronizing mode, which pauses periodic transforms and running external SQL. When the initial snapshot has finished streaming, the message "source snapshot complete (deadline exceeded)" will be written to the log. To complete the synchronization, the Metadb server should be stopped in order to run metadb endsync, and after the "endsync" has completed, the Metadb server can be started again.

Parameters

source_name

A unique name for the data source to be created.

source_type

The type of data source. The only type currently supported is kafka.

OPTIONS ( option 'value' [, …​ ] )

Connection settings and other configuration options for the data source.

Options for data source type "kafka"

brokers

Kafka bootstrap servers (comma-separated list).

security

Security protocol: 'ssl' or 'plaintext'. The default is 'ssl'.

topics

Regular expressions matching Kafka topics to read (comma-separated list).

consumergroup

Kafka consumer group ID.

schemapassfilter

Regular expressions matching schema names to accept (comma-separated list).

schemastopfilter

Regular expressions matching schema names to ignore (comma-separated list).

tablestopfilter

Regular expressions matching table names to ignore (comma-separated list).

trimschemaprefix

Prefix to remove from schema names.

addschemaprefix

Prefix to add to schema names.

module

Name of pre-defined configuration.

Examples

Create sensor as a kafka data source:

CREATE DATA SOURCE sensor TYPE kafka OPTIONS (
    brokers 'kafka:29092',
    topics '^metadb_sensor_1\.',
    consumergroup 'metadb_sensor_1_1',
    addschemaprefix 'sensor_',
    tablestopfilter '^testing\.air_temp$,^testing\.air_temp_avg$'
);

2.5.5. CREATE USER

Define a new database user

CREATE USER user_name WITH option 'value' [, ... ]
Description

CREATE USER defines a new database user that will be managed by Metadb. It also creates a schema having the same name as the user, if the schema does not exist, and grants the user privileges on the schema. The schema is intended as a workspace for the user.

Parameters

user_name

The name of the new user.

WITH ( option 'value' [, …​ ] )

Configuration options for the new user.

Options

password

Sets the user’s password. (Required)

comment

Stores a comment about the user, e.g. the user’s real name. The comment can be viewed in psql using the \du+ command, or in other user interfaces.

Examples

Create a user wegg:

CREATE USER wegg WITH PASSWORD 'LZn2DCajcNHpGR3ZXWHD', COMMENT 'Silas Wegg';

2.5.6. DROP DATA SOURCE

Remove a data source configuration

DROP DATA SOURCE source_name
Description

DROP DATA SOURCE removes a data source configuration.

Parameters

source_name

The name of an existing data source.

Examples

Drop a data source sensor:

DROP DATA SOURCE sensor;

2.5.7. LIST

Show the value of a system variable

LIST name
Description

LIST shows the current setting of various system configurations and other variables.

Parameters

name

authorizations

Authorized users.

data_origins

Configured data origins.

data_sources

Configured data sources.

status

Current status of system components.

Examples
LIST status;

2.5.8. REFRESH INFERRED COLUMN TYPES

Metadb 1.2

Infer column types from data

REFRESH INFERRED COLUMN TYPES
Description

REFRESH INFERRED COLUMN TYPES scans all tables extracted from data sources, looking for text columns that contain only data of a more specific type such as uuid, and converts those columns to the specific type.

This command is run once automatically after an initial synchronization.

Examples

Rescan tables and adjust the column types:

REFRESH INFERRED COLUMN TYPES;

3. Server administration

3.1. System requirements

3.1.1. Hardware requirements

  • Metadb software:

    • Architecture: x86-64 (AMD64)

    • CPU: 2 cores

    • Memory: 1 GB

    • Local storage: 500 GB SSD

  • PostgreSQL database:

    • CPU: 4 cores

    • Memory: 32 GB

    • Storage: 2 TB SSD

3.1.2. Software requirements

  • Operating system: Debian 12 or later

  • PostgreSQL 15 or later

  • Required to build from source:

    • Go 1.21 or later

    • goyacc (installation instructions below)

    • Ragel 6.10 or later

    • GCC C compiler 9.3.0 or later

3.1.3. PostgreSQL configuration

  • checkpoint_timeout: 3600

  • cpu_tuple_cost: 0.03

  • default_statistics_target: 1000

  • effective_io_concurrency: 200

  • idle_in_transaction_session_timeout: 60000

  • idle_session_timeout: 86400000

  • maintenance_work_mem: 1000000

  • max_wal_size: 10240

  • shared_buffers: 1250000

  • statement_timeout: 3600000

  • work_mem: 350000

3.2. Building the software

First set the GOPATH environment variable to specify a path that can serve as the build workspace for Go, e.g.:

export GOPATH=$HOME/go

Then install goyacc:

go install golang.org/x/tools/cmd/goyacc@master

export PATH=$PATH:$GOPATH/bin

Then to build the server:

./build.sh

The build.sh script creates a bin/ subdirectory and builds the metadb executable there:

./bin/metadb help

It is suggested that a metadb user be created and the server run by that user, for example, in /home/metadb.

3.3. Server configuration

Metadb makes use of local storage in a data directory which is created using metadb with the init command. In this example we will create the data directory as data/:

metadb init -D data

This will also create a file metadb.conf in the data directory with the following structure:

[main]
host = host name of the PostgreSQL server
port = port number of the PostgreSQL server
database = database name
superuser = database superuser such as "postgres"
superuser_password = password of superuser
systemuser = database user that is the owner of the database
systemuser_password = password of systemuser
sslmode = should be set to "require" or stronger option

Metadb expects the database name to be metadb or to begin with metadb_; otherwise it logs a warning message.

This file should be edited to add database connection parameters, for example:

[main]
host = a.b.c
port = 5432
database = metadb
superuser = postgres
superuser_password = zpreCaWS7S79dt82zgvD
systemuser = mdbadmin
systemuser_password = ZHivGie5juxGJZmTObHU
sslmode = require

Metadb will assume that the database, superuser, and systemuser defined here already exist; so they should be created before continuing.

3.4. Backups

It is essential to make regular backups of Metadb and to test the backups.

In general, persistent data are stored in the database, and therefore the database is the most important system that should be backed up often.

The data directory contains the metadb.conf configuration file and is also used for temporary storage. The metadb.conf file should be backed up, or alternatively it should be possible to reconstruct the file’s contents.

3.5. Upgrading from a previous version

To upgrade from any previous version of Metadb, stop the server (if running), and then run the upgrade process in case changes are required:

metadb upgrade -D data

The upgrade process may, in some instances, take a significant amount of time to run. The database generally remains available to users during this period.

If no changes are needed, the process outputs:

metadb: "data" is up to date

3.6. Running the server

To start the server:

nohup metadb start -D data -l metadb.log &

The --memlimit option can be used to set a soft memory limit (in GiB) if needed, for example:

nohup metadb start -D data -l metadb.log --memlimit 2 &

The server listens on port 8550 by default, but this can be set using the --port option. The --debug option enables verbose logging.

To stop the server:

metadb stop -D data

Note that stopping or restarting the server may delay scheduled data updates or cause them to restart.

The server can be set up to run with systemd via a file such as /etc/systemd/system/metadb.service, for example:

[Unit]
Description=Metadb
After=network.target remote-fs.target

[Service]
Type=simple
User=metadb
ExecStart=/bin/bash -ce "exec /home/metadb/bin/metadb start -D /home/metadb/data -l /home/metadb/metadb.log"
Restart=on-abort

[Install]
WantedBy=multi-user.target

Then:

systemctl enable metadb

systemctl start metadb

3.7. Connecting to the server

The PostgreSQL terminal-based client, psql, is used to connect to a Metadb server that runs on the same host and listens on a specified port:

psql -d metadb -h localhost -X -p <port>

For example:

psql -d metadb -h localhost -X -p 8550

See Reference > Statements for commands that can be issued via psql.

3.8. Configuring a Kafka data source

3.8.1. Overview

Metadb currently supports reading Kafka messages in the format produced by the Debezium PostgreSQL connector for Kafka Connect. Configuration of Kafka, Kafka Connect, Debezium, and PostgreSQL logical decoding is beyond the scope of this documentation, but a few notes are included here.

Data flow in this direction:

  1. A source PostgreSQL database

  2. Kafka Connect/Debezium

  3. Kafka

  4. Metadb

  5. The analytics database

3.8.2. Creating a connector

To allow capturing data changes in the source PostgreSQL database, logical decoding has to be enabled, in particular by setting wal_level = logical in postgresql.conf.

Next we create a connector configuration file for Kafka Connect:

{
    "name": "sensor-1-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.dbname": "sourcedb",
        "database.hostname": "example.host.name",
        "database.password": "eHrkGrZL8mMJOFgToqqL",
        "database.port": "5432",
        "database.server.name": "metadb_sensor_1",
        "database.user": "dbuser",
        "plugin.name": "pgoutput",
        "snapshot.mode": "exported",
        "tasks.max": "1",
        "truncate.handling.mode": "include",
        "publication.autocreate.mode", "filtered"
        "heartbeat.interval.ms": "30000",
        "heartbeat.action.query": "UPDATE admin.heartbeat set last_heartbeat = now();"
    }
}

It is recommended to use the connector configuration settings heartbeat.interval.ms and heartbeat.action.query as above to avoid spikes in disk space consumption within the source database. (See the Debezium PostgreSQL connector documentation for more details.) The schemastopfilter option of the CREATE DATA SOURCE command is used to filter out the heartbeat table.

In the source database:

CREATE SCHEMA admin;

CREATE TABLE admin.heartbeat (last_heartbeat timestamptz PRIMARY KEY);

INSERT INTO admin.heartbeat (last_heartbeat) VALUES (now());

Then to create the connector:

curl -X POST -i -H "Accept: application/json" -H "Content-Type: application/json" \
     -d @connector.json https://kafka.connect.server/connectors

Note the 1 included in name and database.server.name in the connector configuration. This is suggested as a version number, which can be incremented if the data stream needs to be resynchronized with a new connector.

Metadb requires all streamed tables to have a primary key defined. Tables that do not meet this requirement should be filtered out in the Debezium PostgreSQL connector configuration by setting schema.exclude.list or table.exclude.list. Otherwise they will generate error messages in the Metadb log.

3.8.3. Monitoring replication

The replication slot disk usage must be monitored, because under certain error conditions it can grow too large and possibly fill up the disk. To show the disk usage (in the source database):

select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
    restart_lsn)) as replicationSlotLag, active from pg_replication_slots;

It is recommended to allocate plenty of extra disk space in the source database.

3.8.4. Creating the data source

In Metadb, a data source is defined using the CREATE DATA SOURCE statement, for example:

CREATE DATA SOURCE sensor TYPE kafka OPTIONS (
    brokers 'kafka:29092',
    topics '^metadb_sensor_1\.',
    consumergroup 'metadb_sensor_1_1',
    addschemaprefix 'sensor_',
    schemastopfilter 'admin'
);

3.8.5. Initial synchronization

Metadb 1.2 When a new data source is first configured using CREATE DATA SOURCE, Metadb automatically puts the database into synchronizing mode, just as if metadb sync had been run (see Server administration > Resynchronizing a data source). This has the effect of pausing periodic transforms and external SQL. When the initial snapshot has finished streaming, the message "source snapshot complete (deadline exceeded)" will be written to the log. Then, to complete this first synchronization, stop the Metadb server, and after that run metadb endsync:

metadb stop -D data

metadb endsync -D data --source sensor

Once "endsync" has finished running, start the Metadb server.

3.8.6. Deleting a connection

Sometimes a connection may have to be deleted and recreated (see Server administration > Resynchronizing a data source). After deleting a connection, the replication slot and publication in the source database should be dropped using:

SELECT pg_drop_replication_slot('debezium');

DROP PUBLICATION dbz_publication;

3.9. Resynchronizing a data source

If a Kafka data stream fails and cannot be resumed, it may be necessary to re-stream a complete snapshot of the data to Metadb. For example, a source database may become unsynchronized with the analytic database, requiring a new snapshot of the source database to be streamed. Metadb can accept re-streamed data in order to synchronize with the source, using the following procedure.

Note that during the synchronization process, the analytics database continues to be available to users. However, streaming updates will be slower than usual, and there temporarily may be missing records (until they are re-streamed) or "extra" records (recently deleted in the source database). Also, periodic transforms and external SQL are paused during synchronization.

Note

The instructions below use the Metadb commands "sync" and "endsync". In Metadb versions before 1.2, these commands were called "reset" and "clean", respectively.

  1. Update the topics and consumergroup configuration settings for the new data stream.

    ALTER DATA SOURCE sensor OPTIONS
        (SET topics '^metadb_sensor_2\.', SET consumergroup 'metadb_sensor_2_1');

    Do not restart the Metadb server but continue directly to Step 2.

  2. Stop the Metadb server and (before starting it again) run metadb sync. This may take some time to run.

    metadb stop -D data
    
    metadb sync -D data --source sensor
  3. Start the Metadb server to begin streaming the data.

  4. Once the new data have finished (or nearly finished) re-streaming, stop the Metadb server, and run metadb endsync to remove any old data that have not been refreshed by the new data stream.

    metadb endsync -D data --source sensor

    The timing of when "endsync" should be run is up to the admninistrator, but it must be run to complete the synchronization process. In most cases it will be more convenient for users if "endsync" is run too late (delaying removal of deleted records) rather than too early (removing records before they have been restreamed).

    Metadb detects when snapshot data are no longer being received, and then writes "source snapshot complete (deadline exceeded)" to the log. This generally means it is a good time to run "endsync".

  5. Start the server.

    Until a failed stream is re-streamed by following the process above, the analytic database may continue to be unsynchronized with the source.

3.10. Creating database users

To create a new database user account:

CREATE USER wegg WITH PASSWORD 'LZn2DCajcNHpGR3ZXWHD', COMMENT 'Silas Wegg';

In addition to creating the user, this also creates a schema with the same name as the user. The schema is intended as a workspace for the user.

Recommendations:
  • Each user account should be for an individual user and not shared by more than one person.

  • Prefer user names of 3 to 8 characters in length.

By default the user does not have access to data tables. To give the user access to all tables generated from a data source (including tables not yet created):

AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO wegg;
Note

AUTHORIZE currently requires restarting the server before it will take effect.

3.11. Administrative database changes

It is possible to make administrative-level changes directly in the underlying PostgreSQL database, such as providing additional tables for users. However, the following guidelines should be followed strictly to avoid disrupting the operation of Metadb:

  1. No changes should be made to any database objects created by Metadb. If it should become necessary to make changes to the database objects at the request of the Metadb maintainers, the server should be stopped first to prevent it from operating with an out-of-date cache. If changes are made to the database objects inadvertently, the server should be stopped as soon as possible and not started until the changes have been reversed.

  2. Changes generally should be made while logged in as a new database user (not a superuser) that is created specifically for that purpose.

  3. Any new database objects should be created in a new schema that will not coincide with a schema that may be created by Metadb. This can be ensured by always setting addschemaprefix in data source configurations and avoiding names with those prefixes when creating a new schema.

  4. Database views should not be created in the database.

4. External projects

4.1. FOLIO

4.1.1. MARC transform

Metadb transforms MARC records from the tables marc_records_lb and records_lb in schema folio_source_record to a tabular form which is stored in a new table, folio_source_record.marc__t. Only records considered to be current are transformed, where current is defined as having state = 'ACTUAL' and an identifier present in 999 ff $i.

This transform updates the table folio_source_record.marc__t usually every few hours or so. The time of the most recent update can be retrieved from the table metadb.table_update:

SELECT updated
    FROM metadb.table_update
    WHERE schemaname = 'folio_source_record' AND tablename = 'marc__t';

The MARC transform stores partition tables in the schema marctab. Users can ignore this schema, as all data are accessible via folio_source_record.marc__t.

4.1.2. Derived tables

FOLIO "derived tables" are automatically updated once per day, usually at about 03:00:00 UTC by default.

Note that the derived tables are based on a periodic snapshot of data, and for this reason they are generally not up-to-date.

4.1.3. Data model

FOLIO does not provide documentation for its internal data model, which Metadb tables are based on, but it does have some data documentation for its "storage module" APIs which are roughly equivalent. This is located at https://dev.folio.org/reference/api/. The name of most storage modules ends in -storage, but some modules use a different convention; for instance, the storage module for users is mod-users. (All module names begin with mod-.)

4.1.4. Migrating from LDP

This section contains notes related to migrating from LDP to Metadb.

4.1.4.1. Table names

Table names have changed and now are derived from FOLIO internal table names:

LDP table Metadb table
acquisition_method
folio_orders.acquisition_method
acquisitions_memberships
folio_orders.acquisitions_unit_membership
acquisitions_units
folio_orders.acquisitions_unit
audit_circulation_logs
folio_audit.circulation_logs
circulation_cancellation_reasons
folio_circulation.cancellation_reason
circulation_check_ins
folio_circulation.check_in
circulation_fixed_due_date_schedules
folio_circulation.fixed_due_date_schedule
circulation_loan_history
folio_circulation.audit_loan
circulation_loan_policies
folio_circulation.loan_policy
circulation_loans
folio_circulation.loan
circulation_patron_action_sessions
circulation_patron_action_sessions
circulation_patron_notice_policies
folio_circulation.patron_notice_policy
circulation_request_policies
folio_circulation.request_policy
circulation_request_preference
folio_circulation.user_request_preference
circulation_requests
folio_circulation.request
circulation_scheduled_notices
folio_circulation.scheduled_notice
circulation_staff_slips
folio_circulation.staff_slips
configuration_entries
folio_configuration.config_data
course_copyrightstatuses
folio_courses.coursereserves_copyrightstates
course_courselistings
folio_courses.coursereserves_courselistings
course_courses
folio_courses.coursereserves_courses
course_coursetypes
folio_courses.coursereserves_coursetypes
course_departments
folio_courses.coursereserves_departments
course_processingstatuses
folio_courses.coursereserves_processingstates
course_reserves
folio_courses.coursereserves_reserves
course_roles
folio_courses.coursereserves_roles
course_terms
folio_courses.coursereserves_terms
email_email
folio_email.email_statistics
feesfines_accounts
folio_feesfines.accounts
feesfines_comments
folio_feesfines.comments
feesfines_feefineactions
folio_feesfines.feefineactions
feesfines_feefines
folio_feesfines.feefines
feesfines_lost_item_fees_policies
folio_feesfines.lost_item_fee_policy
feesfines_manualblocks
folio_feesfines.manualblocks
feesfines_overdue_fines_policies
folio_feesfines.overdue_fine_policy
feesfines_owners
folio_feesfines.owners
feesfines_payments
folio_feesfines.payments
feesfines_refunds
folio_feesfines.refunds
feesfines_transfer_criterias
folio_feesfines.transfer_criteria
feesfines_transfers
folio_feesfines.transfers
feesfines_waives
folio_feesfines.waives
finance_budgets
folio_finance.budget
finance_expense_classes
folio_finance.expense_class
finance_fiscal_years
folio_finance.fiscal_year
finance_fund_types
folio_finance.fund_type
finance_funds
folio_finance.fund
finance_group_fund_fiscal_years
folio_finance.group_fund_fiscal_year
finance_groups
folio_finance.groups
finance_ledgers
folio_finance.ledger
finance_transactions
folio_finance.transaction
inventory_alternative_title_types
folio_inventory.alternative_title_type
inventory_bound_with_part
folio_inventory.bound_with_part
inventory_call_number_types
folio_inventory.call_number_type
inventory_campuses
folio_inventory.loccampus
inventory_classification_types
folio_inventory.classification_type
inventory_contributor_name_types
folio_inventory.contributor_name_type
inventory_contributor_types
folio_inventory.contributor_type
inventory_electronic_access_relationships
folio_inventory.electronic_access_relationship
inventory_holdings
folio_inventory.holdings_record
inventory_holdings_note_types
folio_inventory.holdings_note_type
inventory_holdings_sources
folio_inventory.holdings_records_source
inventory_holdings_types
folio_inventory.holdings_type
inventory_identifier_types
folio_inventory.identifier_type
inventory_ill_policies
folio_inventory.ill_policy
inventory_instance_formats
folio_inventory.instance_format
inventory_instance_note_types
folio_inventory.instance_note_type
inventory_instance_relationship_types
folio_inventory.instance_relationship_type
inventory_instance_relationships
folio_inventory.instance_relationship
inventory_instance_statuses
folio_inventory.instance_status
inventory_instance_types
folio_inventory.instance_type
inventory_instances
folio_inventory.instance
inventory_institutions
folio_inventory.locinstitution
inventory_item_damaged_statuses
folio_inventory.item_damaged_status
inventory_item_note_types
folio_inventory.item_note_type
inventory_items
folio_inventory.item
inventory_libraries
folio_inventory.loclibrary
inventory_loan_types
folio_inventory.loan_type
inventory_locations
folio_inventory.location
inventory_material_types
folio_inventory.material_type
inventory_modes_of_issuance
folio_inventory.mode_of_issuance
inventory_nature_of_content_terms
folio_inventory.nature_of_content_term
inventory_service_points
folio_inventory.service_point
inventory_service_points_users
folio_inventory.service_point_user
inventory_statistical_code_types
folio_inventory.statistical_code_type
inventory_statistical_codes
folio_inventory.statistical_code
invoice_invoices
folio_invoice.invoices
invoice_lines
folio_invoice.invoice_lines
invoice_voucher_lines
folio_invoice.voucher_lines
invoice_vouchers
folio_invoice.vouchers
notes
folio_notes.note
organization_addresses
folio_organizations.addresses
organization_categories
folio_organizations.categories
organization_contacts
folio_organizations.contacts
organization_emails
folio_organizations.emails
organization_interfaces
folio_organizations.interfaces
organization_organizations
folio_organizations.organizations
organization_phone_numbers
folio_organizations.phone_numbers
organization_urls
folio_organizations.urls
patron_blocks_user_summary
folio_patron_blocks.user_summary
perm_permissions
folio_permissions.permissions
perm_users
folio_permissions.permissions_users
po_alerts
folio_orders.alert
po_lines
folio_orders.po_line
po_order_invoice_relns
folio_orders.order_invoice_relationship
po_order_templates
folio_orders.order_templates
po_pieces
folio_orders.pieces
po_purchase_orders
folio_orders.purchase_order
po_receiving_history
(Not supported in Metadb)
po_reporting_codes
folio_orders.reporting_code
srs_error
folio_source_record.error_records_lb
srs_marc
folio_source_record.marc_records_lb
srs_marctab
folio_source_record.marc__t
srs_records
folio_source_record.records_lb
template_engine_template
folio_template_engine.template
user_addresstypes
folio_users.addresstype
user_departments
folio_users.departments
user_groups
folio_users.groups
user_proxiesfor
folio_users.proxyfor
user_users
folio_users.users
4.1.4.2. Column names

The data column in LDP contains JSON objects. In Metadb this column appears as jsonb or in some cases content, matching the FOLIO internal column names.

4.1.4.3. Data types

In Metadb, UUIDs generally have the uuid data type. If a UUID has the text data type preserved from the source data, it should be cast using ::uuid in queries.

Columns with the json data type in LDP have been changed to use the jsonb data type in Metadb.

4.1.4.4. JSON queries

Querying JSON is very similar with Metadb as compared to LDP. For clarity we give a few examples below.

JSON source data

To select JSON data extracted from a FOLIO source, LDP supports:

SELECT data FROM user_groups;

In Metadb, this can be written as:

SELECT jsonb FROM folio_users.groups;

Or with easier to read formatting:

SELECT jsonb_pretty(jsonb) FROM folio_users.groups;
JSON fields: non-array data

For non-array JSON fields, extracting the data directly from JSON in LDP usually takes the form:

SELECT json_extract_path_text(data, 'group') FROM user_groups;

The form recommended for Metadb is:

SELECT jsonb_extract_path_text(jsonb, 'group') FROM folio_users.groups;
JSON fields: array data

To extract JSON arrays, the syntax for Metadb is similar to LDP. A lateral join can be used with the function jsonb_array_elements() to convert the elements of a JSON array to a set of rows, one row per array element.

For example, if the array elements are simple text strings:

CREATE TABLE instance_format_ids AS
SELECT id AS instance_id,
       instance_format_ids.jsonb #>> '{}' AS instance_format_id,
       instance_format_ids.ordinality
FROM folio_inventory.instance
    CROSS JOIN LATERAL jsonb_array_elements(jsonb_extract_path(jsonb, 'instanceFormatIds'))
        WITH ORDINALITY AS instance_format_ids (jsonb);

If the array elements are JSON objects:

CREATE TABLE holdings_notes AS
SELECT id AS holdings_id,
       jsonb_extract_path_text(notes.jsonb, 'holdingsNoteTypeId')::uuid
           AS holdings_note_type_id,
       jsonb_extract_path_text(notes.jsonb, 'note') AS note,
       jsonb_extract_path_text(notes.jsonb, 'staffOnly')::boolean AS staff_only,
       notes.ordinality
FROM folio_inventory.holdings_record
    CROSS JOIN LATERAL jsonb_array_elements(jsonb_extract_path(jsonb, 'notes'))
        WITH ORDINALITY AS notes (jsonb);
JSON fields as columns

LDP transforms simple, first-level JSON fields into columns, which can be queried as, for example:

SELECT id, "group", "desc" FROM user_groups;

The Metadb equivalent of this query is:

SELECT id, "group", "desc" FROM folio_users.groups__t;

Note that the double quotation marks are needed here only because group and desc are reserved words in SQL. Alternatively, they can be removed if the column names are prefixed with a table alias:

SELECT g.id, g.group, g.desc FROM folio_users.groups__t AS g;

Support for transforming subfields and arrays is planned in Metadb.

4.1.5. Configuring Metadb for FOLIO

When creating a FOLIO data source, use the module 'folio' option, and set trimschemaprefix to remove the tenant from schema names and addschemaprefix to add a folio_ prefix to the schema names. For example:

CREATE DATA SOURCE folio TYPE kafka OPTIONS (
    module 'folio',
    trimschemaprefix 'tenantname_',
    addschemaprefix 'folio_',
    brokers 'kafka:29092',
    topics '^metadb_folio_1\.',
    consumergroup 'metadb_folio_1_1',
    schemastopfilter 'admin'
);

It is recommended to use a separate Kafka cluster, rather than the FOLIO Kafka instance, until one has experience with administration of Kafka.

In the Debezium PostgreSQL connector configuration, the following exclusions are suggested:

"schema.exclude.list": "public,.*_mod_login,.*_mod_pubsub,.*pubsub_config,supertenant_mod_.*,.*_mod_kb_ebsco_java,.*_mod_data_export_spring"
"table.exclude.list": ".*__system,.*_mod_agreements.alternate_resource_name,.*_mod_service_interaction.dashboard_access,.*_mod_agreements.availability_constraint,.*_mod_agreements\\.package_description_url,.*_mod_agreements\\.content_type,.*_mod_agreements\\.entitlement_tag,.*_mod_agreements\\.erm_resource_tag,.*_mod_agreements\\.string_template,.*_mod_agreements\\.string_template_scopes,.*_mod_agreements\\.templated_url,.*_mod_oai_pmh\\.instances,.*_mod_remote_storage\\.original_locations,.*_mod_remote_storage\\.item_notes,.*app_setting,.*alternate_name,.*databasechangelog,.*databasechangeloglock,.*directory_entry_tag,.*license_document_attachment,.*license_supp_doc,.*license_tag,.*log_entry_additional_info,.*subscription_agreement_supp_doc,.*subscription_agreement_document_attachment,.*subscription_agreement_ext_lic_doc,.*subscription_agreement_tag,.*tenant_changelog,.*tenant_changelog_lock,.*marc_indexers.*,.*rmb_internal.*,.*rmb_job.*,.*_mod_agreements\\.match_key,.*system_changelog"

4.2. ReShare

4.2.1. Derived tables

ReShare "derived tables" are automatically updated once per day, usually at about 03:00:00 UTC by default.

Note that the derived tables are based on a periodic snapshot of data, and for this reason they are generally not up-to-date.

4.2.2. Configuring Metadb for ReShare

Before defining a ReShare data source, create a data origin for each consortial tenant. For example:

CREATE DATA ORIGIN tenant1;

CREATE DATA ORIGIN tenant2;

CREATE DATA ORIGIN tenant3;
Note

CREATE DATA ORIGIN currently requires restarting the server before it will take effect.

Then use the module 'reshare' option when creating the data source, and set addschemaprefix to add a reshare_ prefix to the schema names:

CREATE DATA SOURCE reshare TYPE kafka OPTIONS (
    module 'reshare',
    addschemaprefix 'reshare_',
    brokers 'kafka:29092',
    topics '^metadb_reshare_1\.',
    consumergroup 'metadb_reshare_1_1',
    schemastopfilter 'admin'
);

Note that the order of commands is important: The initial set of data origins should be created before the data source is created so that schema names of incoming data will be processed correctly. Later, whenever a new consortial tenant is to be added, it should be defined in Metadb using CREATE DATA ORIGIN (and the server restarted) before the tenant is added to ReShare.

In the Debezium PostgreSQL connector configuration, it is suggested that credentials (.+mod_login), the public schema, the Okapi supertenant (supertenant_mod_.+), and mod-pubsub data (pubsub_config,.+_mod_pubsub) be excluded using the schema.exclude.list setting.

4.3. MARC transform for LDP

Metadb 1.1 The MARC transform in Metadb can also be used with LDP (and LDLite). A command-line tool called marct is provided which is a drop-in replacement for ldpmarc.

The system requirements are a subset of those for Metadb:

  • Local storage: 500 GB

  • Database storage: 500 GB

  • Operating system: Linux

  • PostgreSQL 15 or later

  • Go 1.20 or later

To build marct:

mkdir -p bin && go build -o bin ./cmd/marct

which creates a bin/ subdirectory and builds the marct executable there:

./bin/marct -h

In LDP, MARC data are read from the tables public.srs_marc and public.srs_records, and the transformed output is written to the table public.srs_marctab.

Typical usage is:

./bin/marct -D <datadir> -u <ldp_user>

where datadir is a LDP data directory containing ldpconf.json, and ldp1_user is a LDP user to be granted SELECT privileges on the output table.

For example:

./bin/marct -D data -u ldp

Note that marct only grants privileges for a single user. If individual user accounts are configured for LDP, a shell script can be used to grant privileges to the users, for example:

users=/path/to/list/of/users.txt
for u in $( cat $users ); do
    psql -c "GRANT SELECT ON public.srs_marctab TO $u ;"
done

The first time marct runs, it will perform a "full update" of all of the MARC records. In subsequent runs, it will attempt to use "incremental update" to update only records that have changed since the previous run, which can dramatically reduce the running time if the number of changes is small.

However, if very many records have changed, it is possible that incremental update may take longer than full update. If it appears that an incremental update will never finish, it should be canceled, and a full update should be run once before resuming incremental updates. This can be done by using the -f command-line option, which disables incremental update and requires marct to do a full update.