LibraryCloud Support for LD4L (original source doc LibraryCloud Support for LD4L)

Goals

Background Reading

LibraryCloud Technical Architecture

LibraryCloud API Documentation

LibraryCloud Enhancements

LibraryCloud currently reads data from Aleph, Via, and OASIS, and runs records through a pipeline of normalization and enrichment steps. At the end of the pipeline, the normalized MODS records are published to an AWS queue (using SQS), from which they are written to an Solr index which supports the LibraryCloud Item API.

To support LD4L, we will implement the following enhancements:

A) Publish LibraryCloud messages to LD4L

LibraryCloud messages containing MODS records will be published to an SQS queue that is owned and controlled by an LD4L AWS account. LD4L can elect to receive all records (as part of a full weekly refresh) or just changed records (or both, in separate queues).

Implementation detail: The LibraryCloud ingest pipeline will be enhanced to support a publish-subscribe model, using AWS SNS (Simple Notification Service) and SQS. This will allow management and configuration of additional consumers through AWS, without requiring code changes.

B) Access to full MARC records

The MODS records published by LibraryCloud will be enhanced to include a link to the full MARC record from which the MODS record was created

Implementation detail: The MARC records will be uploaded to S3 and the link saved in the MARC record as part of an enrichment step during the ingest process.

C) SDK

We will implement a Java SDK to simplify the process of building an application that consumes LibraryCloud data. Users (e.g. developers) will only be required to configure the queue from which they want to read records and implement a single function to process the data.  

Implementation detail: The SDK will be managed as a publicly available Github repo containing sample Java code, Maven configuration (pom.xml), Camel configuration files, and libraries for marshalling/unmarshalling LibraryCloud messages. It will also contain Vagrant scripts for spinning up EC2 servers on AWS to run the ingest process.

Open Questions / Other Options

Ingest contains only IDs

Consumers of data from the LibraryCloud pipeline might prefer to receive just a list of IDs, rather than the full MODS records, especially if, as in the LD4L case, they are primarily going to be working with the MARC data. Another option (in combination with what’s described above) would be to provide access to an SQS queue that only contains record IDs; then LD4L would use those records IDs to lookup the MARC data using the LibraryCloud API. (Another extension would be to allow consumers to directly request a record in its original format from the LibraryCloud API, rather than first requesting the MODS and then separately looking up the original record).

 

Use S3 for data in the ingest pipeline

The messages in the LibraryCloud pipeline currently cannot be larger than 256K due to SQS message size limits. Our current approach is to limit the number of MODS records in each message to keep the message size below 256K. Another option is to store the message payload in S3, which would eliminate any limit on the size of the messages. Any complexity added would be confined to the standard unmarshalling/marshalling routines. How this would affect throughput is unknown - there would be additional latency in retrieving and saving the payload data at each step, but it would likely be more efficient for enrichment steps to operate on larger chunks of data at a time.

Development tasks (add about 1.5 months to start of this work...)


Week of May 25

Week of June 1

Week of June 8

Week of June 15

Week of June 22

Week of June 29

Week of July 6

Week of July 13

Week of July 20

Week of July 2

Week of August 3

LD4L QA and Testing