Harvard Wiki has been integrated with Group Services.
Wiki administrators: visit IT Help for an overview of the changes to managing groups in your wikis.
Skip to end of metadata
Go to start of metadata

LibraryCloud Support for LD4L (original source doc LibraryCloud Support for LD4L)

Goals

  • Provide access to all data aggregated by LibraryCloud for inclusion in the LD4L triple-store

  • Pilot approach for allowing third-party users access to LibraryCloud data pipeline

Background Reading

LibraryCloud Technical Architecture

LibraryCloud API Documentation

LibraryCloud Enhancements

LibraryCloud currently reads data from Aleph, Via, and OASIS, and runs records through a pipeline of normalization and enrichment steps. At the end of the pipeline, the normalized MODS records are published to an AWS queue (using SQS), from which they are written to an Solr index which supports the LibraryCloud Item API.

To support LD4L, we will implement the following enhancements:

A) Publish LibraryCloud messages to LD4L

LibraryCloud messages containing MODS records will be published to an SQS queue that is owned and controlled by an LD4L AWS account. LD4L can elect to receive all records (as part of a full weekly refresh) or just changed records (or both, in separate queues).

Implementation detail: The LibraryCloud ingest pipeline will be enhanced to support a publish-subscribe model, using AWS SNS (Simple Notification Service) and SQS. This will allow management and configuration of additional consumers through AWS, without requiring code changes.

B) Access to full MARC records

The MODS records published by LibraryCloud will be enhanced to include a link to the full MARC record from which the MODS record was created

Implementation detail: The MARC records will be uploaded to S3 and the link saved in the MARC record as part of an enrichment step during the ingest process.

C) SDK

We will implement a Java SDK to simplify the process of building an application that consumes LibraryCloud data. Users (e.g. developers) will only be required to configure the queue from which they want to read records and implement a single function to process the data.  

Implementation detail: The SDK will be managed as a publicly available Github repo containing sample Java code, Maven configuration (pom.xml), Camel configuration files, and libraries for marshalling/unmarshalling LibraryCloud messages. It will also contain Vagrant scripts for spinning up EC2 servers on AWS to run the ingest process.

Open Questions / Other Options

Ingest contains only IDs

Consumers of data from the LibraryCloud pipeline might prefer to receive just a list of IDs, rather than the full MODS records, especially if, as in the LD4L case, they are primarily going to be working with the MARC data. Another option (in combination with what’s described above) would be to provide access to an SQS queue that only contains record IDs; then LD4L would use those records IDs to lookup the MARC data using the LibraryCloud API. (Another extension would be to allow consumers to directly request a record in its original format from the LibraryCloud API, rather than first requesting the MODS and then separately looking up the original record).

 

Use S3 for data in the ingest pipeline

The messages in the LibraryCloud pipeline currently cannot be larger than 256K due to SQS message size limits. Our current approach is to limit the number of MODS records in each message to keep the message size below 256K. Another option is to store the message payload in S3, which would eliminate any limit on the size of the messages. Any complexity added would be confined to the standard unmarshalling/marshalling routines. How this would affect throughput is unknown - there would be additional latency in retrieving and saving the payload data at each step, but it would likely be more efficient for enrichment steps to operate on larger chunks of data at a time.

Development tasks (add about 1.5 months to start of this work...)


Week of May 25

  • Test and deploy updated ingestion code (with collections API fix)

Week of June 1

  • Document LD4L architecture 
  • Fix Collection API Namespaces
  • Add search by Collection ID to Item API
  • Add search by Collection Name to Collection API
  • Define features for better Collections API builder

Week of June 8

  • Implement better Collections API builder

Week of June 15

  • Implement better Collections API builder

Week of June 22

  • Out (Vacation)
  • Collections API builder deploy and test

Week of June 29

  • Collections API builder deploy and test

Week of July 6

  • Create sample SDK project
  • Configure SQS->SNS->SQS distribution
  • Write code to unpack SNS-wrapped SQS messages 

Week of July 13

  • Upload original records to S3
  • Add reference to original records to MODS
  • Write code to falling back to S3 storage for large messages 

Week of July 20

  • Deploy filter for unchanged records
  • Create queues for delta loads and full loads
  • LD4L QA and Testing

Week of July 2

  • LD4L QA and Testing

Week of August 3

LD4L QA and Testing

  • No labels