The Labrador Data Retriever API

CAFRI maintains a massive amount of spatial data in the cloud which is useful across multiple projects. Sharing this amount of data over things like Dropbox or Google Drive is impractical, and managing user access to individual files in order to only share smaller pieces sounds like a lot of work. So we built a bespoke data-sharing platform for use within our lab, because that sounded easier.

That platform is named Labrador (because it helps you retrieve your data). Most users will never actually interact with Labrador directly; typically, data is downloaded through the “client” package (creatively named labrador.client) which handles communication with the service.

This file walks through the high-level design of the Labrador API and gives instructions on how to manage the API.

Labrador Anatomy

The API

Design

At its core, Labrador is an R-based API written using the plumber package. This library was chosen for an obvious reason: most people at ESF, if they know any programming language at all, know R. We could likely have built a more efficient version of the same API using Flask/Rocket/Node/Whatever, but doing so would dramatically limit the number of people within CAFRI who could make changes to the API. Building Labrador has never been anyone’s full time job, so prioritizing for skills that the people using it might already have is important.

The API itself is structured as an R package, with a DESCRIPTION file listing the API version and R package dependencies. This lets the API be quickly validated using R CMD check, and ensures that API containers (below) have all required packages installed before the API is run. Code related to API functions is stored inside R, while code related to running the actual API is in inst.

Making a Request

When it comes to raster data access, Labrador thinks of data in terms of products and indices. Products are single raster data sets (for instance, “agb_1.1.0_FEMA_FranklinStLawrence2016” or “mag_2004_nys_lt_ftv_30m”) which are fixed in meaning; a given product name should always represent the exact same data. These raster layers are then split into a number of tiles (typically 432, though see the next paragraph) along a standard grid, resulting in a number of smaller rasters which perfectly align with all other products stored in Labrador. These tiles are referred to by their indices, which represent the coordinates of the top-left corner of the tile in our standard CRS. These tiles are stored in S3 in a bucket named “cafri-share”, and are named following the pattern <product>/<index>.tiff. Labrador has full read access to this bucket.

In some situations, such as with 1-meter resolution rasters, cropping tiles using the standard grid results in tiles which are too large to be delivered over the API without timing out. These rasters are instead cropped using a smaller grid, which subdivides the original tiles to reduce the output file size. In these situations, tiles are still named <product>/<index>.tiff, with index referring to the top-left corner.

When a Labrador instance comes online, it calls the file run_labrador.R. This file sets up the API along with some simple logging and then sits on port 8000, listening for requests.

When a request arrives, Labrador pounces into action. First, the API checks the HTTP headers for an authorization token, and, if found, checks it against a hash stored in our database. If a matching token is found, the request proceeds. This filter is not applied to the landing page at https://labrador.cafri-ny.org so that the API status may be checked in a web browser.

The details of what happens next depend on what endpoint is being queried. The majority of endpoints result in simple database lookups – using user-provided parameters to grab vector files, index numbers applicable to a given area, available products, or so on. Returning raster data is marginally more complicated, however. First, the API will check whether a data product is “restricted” (currently, only tax parcel data). If so, the API then checks to make sure your token has the proper permissions to access this data; otherwise, it will error out. Assuming that you have the proper permissions or the data is unrestricted, the API then moves along to actually producing the data. Using aws s3 copy, the API copies the requested tile (at a specific product and index value) to the /tmp directory, check to make sure the copy was successful, and then stream the file to the user requesting it.

This streaming appears (I believe) to happen asynchronously, freeing the API to begin processing the next request. This is fantastic for efficiency’s sake, but means that the tile must persist on the Labrador box until the user has finished downloading it. However, as some products (when all the tiles are downloaded at once) are larger than the Labrador box has disk space to support, we have a systemd job which runs every minute to delete files in /tmp which haven’t been accessed for more than 5 minutes.

Caching

We’ve begun to implement caching, using the memoise package and cachem::cache_mem(), for more frequent calls that involve network latency or database lookups. At time of writing, this means that the following lookups might be using (potentially stale) cached values:

  • Checking whether or not the user has access to labrador (check_creds()) – 10 minute timeout
  • Looking up the s3 name of the requested product (get_folder_name()) – 1 hour timeout
  • Checking if a product is restricted, like tax parcels (is_restricted()) – 10 minute timeout
  • Checking if a user has access to restricted data (check_sudo()) – 10 minute timeout

This means that changes in the database may take some time to become reflected in the live API. There is currently no way to manually clear the cache for these items; either restart the API, or wait for the cache to expire on its own.

Server Infrastructure

The API runs on an AWS Lightsail instance. You will need to be added to our AWS environment to access it.

On the box, a few files are particularly important:

  • ~/labrador contains the actual Labrador API code itself. Within this repo are several other files used in running the API:
    • ~/labrador/Dockerfile is the file used to build the Labrador docker image. This file is used in-place (that is, Docker looks for /home/admin/labrador/Dockerfile to build the image).
    • ~/labrador/etc/ contains configuration files used to run Docker containers hosting the API and our configuration files for nginx.. These files are not used in-place – to update the actual production versions of these files on the box, make sure to run sudo cp -r etc / from the labrador directory.
    • ~/labrador/usr/ similarly contains shell scripts necessary for the Labrador system, generally managed by systemd. Like with etc, files are not used in-place – to update the actual production versions of these files on the box, make sure to run sudo cp -r usr / from the labrador directory.

The Dockerfile builds an “image” that contains a version of R, the actual Labrador API code, and all other system resources used by the API. This image can be deployed on any computer, and can even be deployed multiple times. We rebuild this image manually when we need to update dependencies or make changes to the environment the API runs in. The image then is frozen in the same form it looked like when it was built.

The docker-compose files (inside etc) then use these images to create “containers”, which actually execute the API code. Every container using the same image looks exactly the same, and contains the same file structure with the same libraries installed, as the image did when it was built. These containers are (mostly) isolated from the rest of the server – it is difficult for the API (or any other code running inside the container) to impact files or processes running on the server outside of the container. The main exceptions are two “volumes” we mount to the containers at runtime – the containers share the /tmp directory with our server (and each other), to take advantage of a crontab that deletes files after five minutes to preserve disk space, and share the /home/admin/labrador directory so that new containers have access to the newest version of the API, regardless of when the image was last built.

An advantage of this isolation is that the “ports” on the container, which apps use to communicate with each other and the Internet, are separate from the “ports” of the actual machine. For that reason, even though containers are all using the same image (which expects connections on port 8000), our docker-compose files specify different port mappings for each container. This way, we can have containers listening on port 8001 and 8002 (for the production API) without needing to specify that inside the container itself – we just redirect all messages sent to the server on 8001 to the appropriate container on 8000.

These details are completely irrelevant to the end-user, who is requesting data from labrador.cafri-ny.org over HTTPS, which runs on port 443 on all servers. nginx listens to all requests on that port and then funnels them appropriately – requests to labrador.cafri-ny.org are sent to the production containers (with the containers taking turns accepting requests), while requests to labrador.cafri-ny.org:9779 are sent to the development endpoint. Our nginx configuration handles this level of mapping, from user requests to docker-compose.

Systemd

Within Labrador’s etc folder are a number of systemd unit files, which configure jobs managed by the operating system on the Lightsail box. Systemd is a standard init system used by the majority of Linux systems, and is a rather expansive program to wrap your head around; a basic introduction can be found at this link.

We don’t use any particularly advanced features of systemd. Most of our needs can be met via a few standard commands. For instance, to restart a systemd service, use the command:

sudo systemctl restart <name_of_service>

Where name_of_service is the full name of the unit file (for example, cert-renew.timer).

To enable a new systemd service (whose unit file is on the box at /etc/systemd/system/name_of_service.service) use:

sudo systemctl enable --now name_of_service.service

Enabling a service means that it will automatically start running whenever the Lightsail box is restarted. In order to run a service without enabling it (that is, without having it start back up when the box restarts), use systemctl start. To stop a service, use systemctl stop; to disable a service, use systemctl stop --now. To see recent logs from a service, use systemctl status.

Currently, there are three main systemd services related to Labrador running on the box:

  • cert-renew.timer renews the TLS/SSL certificate used to protect requests to Labrador on a monthly basis. When the timer activates, it calls the systemd service cert-renew.service, which then runs the script at /usr/local/bin/cert_renew.sh.
  • labrador-tmp-delete.timer deletes any files in /tmp/labrador which are older than five minutes. It does this by calling the systemd service labrador-tmp-delete.service, which handles the deletions directly.
  • docker-compose@.service takes advantage of systemd’s special usage of the @ symbol. Anything after the @ symbol in a systemd command will be used as a variable in executing the actual service. This job uses that to run Docker Compose applications in a standard interface, including Labrador and Labrador-Dev. Any Docker Compose app with a docker-compose.yml file in /etc/docker/compose can be managed using this service; simply use docker-compose@name_of_folder as the service name when calling systemd commands (for instance, systemctl status docker-compose@labrador).

Autodog

The last component of the puzzle is the service which tracks available products and their definitions, named autodog. This is short for “Automatic Dogumentation”, as this service used to build a documentation website which has since been deprecated, but the pun is still very good.

These days, the autodog service gets a list of all products available in S3 and uses this to update the list of available products. This service is named autodog and is managed via systemctl.

Note that autodog uses stored procedures in our postgres database to perform these lookups; to see the queries, look in inst/stored_procedures.psql. These procedures can be queried outside of the autodog context if useful.

Monitoring

Labrador uses the openmetrics R package to publish statistics on how long the API takes to answer requests, how many requests are being received, and a few other relevant data points. These metrics are “registered” as part of the API initialization script, and create a new endpoint at https://labrador.cafri-ny.org/metrics.

These metrics are then ingested by another service running on the box, Prometheus. This service scrapes the most recent metrics every 5 seconds, as configured in the file /etc/prometheus/prometheus.yml. Metrics are then stored for 7 days.

These metrics can be seen and queried using Grafana, which runs on the Labrador box and can be accessed at http://labrador.cafri-ny.org:3000/. To access this URL, you’ll need to add your IP address to the whitelist in Lightsail (see our AWS docs, and the Lightsail section in particular for access instructions); to log in, use the username “admin” and the password stored in /home/admin/.grafana_readme on the Labrador box.

The Prometheus app itself is installed in /usr/local/bin/prometheus, configured by a file at /etc/prometheus/prometheus.yml, and managed via a systemctl service file at /etc/systemd/system/prometheus.service. You can restart the service via sudo systemctl restart prometheus. Updating Prometheus requires downloading the newest Prometheus version, unpacking it via tar -xzvf, and then moving the downloaded prometheus and promtool apps to /usr/local/bin.

  • Add information on how Grafana run on the box. Was it installed directly? How to restart, how to update?

Doing Tricks

Query the Dev Endpoint

Run Sys.setenv("labrador_port" = "9779") in your R session and use labrador.client as normal. Note that this is not the port that labrador_dev is configured to listen on in docker-compose, but rather the port nginx routes to labrador_dev (configured in etc/nginx/sites-enabled). This is confusing.

Adding Users

  1. On the Lightsail box, enter the labrador directory and run Rscript -e "devtools::install()".
  2. Enter an R session (by running R) and run labrador::add_user().

Restarting the API

  1. On the Lightsail box, run sudo systemctl restart docker-compose@labrador.

Deploying Changes

  1. On the Lightsail box, enter the labrador directory and pull your changes to the box. Make sure the git branch in ~/labrador is the one with your changes.
  2. To test your changes on the development endpoint, run the following commands:
cd /etc/docker/compose/labrador_dev
docker-compose build --no-cache
sudo systemctl restart docker-compose@labrador_dev
  1. On your local machine, run Sys.setenv("labrador_port" = 9779) in R to send requests to the development endpoint.
  2. When satisfied with your testing, run the following on the Lightsail box to restart the production API:
cd /etc/docker/compose/labrador
docker-compose build --no-cache
sudo systemctl restart docker-compose@labrador

Reading Logs

  1. On the Lightsail box, cd to either /etc/docker/compose/labrador or /etc/docker/compose/labrador_dev (as appropriate).
  2. Run docker-compose logs to view the log file for that service.
  3. Useful options include --tail=#, to see the last # logs, and -f to “follow” logs and have them print to your terminal as new requests come in.

Adding Regions

To add a new region to Labrador, you first need to get the file containing the boundaries of the new region onto the Lightsail box. One way to achieve that, which will upload your file to the /tmp directory on the box, is to run this:

aws_access_pem_key="<path_to_your_pem_key>"
region_geometry="<path_to_your_vector_file>"

scp -i "$aws_access_pem_key" "$region_geometry" admin@54.144.185.0:/tmp/

Once your geometry is on the Lightsail box, ssh onto the box:

ssh -i "$aws_access_pem_key" admin@54.144.185.0

Start an R session (via R) on the Lightsail box. You can then upload your region using the labrador::add_region_to_db() function, whose arguments work like this:

labrador::add_region_to_db(
  sf::read_sf("/tmp/<path_to_region_file>"), 
  "<The name of your region, like 'state_shoreline'>",
  "<The group your region belongs to, like 'civil'>",
  "<An explanation of what your region represents>"
)

Please make sure to delete your geometry file from the Lightsail box after this!

Adding Products

See documentation in https://github.com/cafri-labs/cafri-share

Data should be available to download once the autodog service updates the list of products in the database, which happens roughly every hour. If you want to access your data faster than that, run sudo systemctl restart autodog to trigger an immediate update.

Renewing Certificates

We now have a systemd service running at midnight (UTC) on the first of every month to auto-renew our certificate. See etc/systemd/system/cert-renew.* and /usr/local/bin/cert_renew.sh. This will take our services down for a few minutes each month.

If you want to manually renew the certificate, ssh onto the lightsail box and run the following:

sudo systemctl stop docker-compose@labrador
sudo systemctl stop nginx.service
sudo certbot renew
sudo systemctl start nginx.service
sudo systemctl start docker-compose@labrador

Changing the Download Mode

labrador.client looks for an environment variable called "labrador_download_mode", which when set to "aws" allows us to (mostly) bypass https requests to the labrador server and download tiles directly from S3. All other settings for "labrador_download_mode" result in normal operations, with all requests hitting the labrador server and all data transfers occuring over https. This setting is meant to offer speedy downloads, without https overhead, only when we are operating from compute instances within our AWS cloud infrastructure with S3 access and the AWS CLI installed.

Connect to the database

To connect to our RDS instance from the Labrador box, run the following command:

psql -U postgres -h dev-1-cafri.c3yxeexiauxi.us-east-1.rds.amazonaws.com

The password for this database is stored as the rds variable in the .Renviron on the box.

Our RDS is a bog-standard PostGIS installation, meaning it works like any standard Postgres database. Run \list to see the available databases, \dt to see the available tables in your current database, and \c <name> to change to a different database. Remember that commands need to end with a semicolon.

Grant a user “sudo” permissions

To grant a user superuser status – so they can access tax data, or any other protected products – run the following to connect to our RDS instance:

psql -U postgres -h dev-1-cafri.c3yxeexiauxi.us-east-1.rds.amazonaws.com

Then, in Postgres, run the following:

\c cafri_share
INSERT INTO sudo_accounts VALUES ('<email>');

Where <email> is the email associated with the user’s Labrador token.