LibraryCloud Support for LD4L (original source doc LibraryCloud Support for LD4L)
Provide access to all data aggregated by LibraryCloud for inclusion in the LD4L triple-store
Pilot approach for allowing third-party users access to LibraryCloud data pipeline
LibraryCloud Technical Architecture
LibraryCloud API Documentation
LibraryCloud currently reads data from Aleph, Via, and OASIS, and runs records through a pipeline of normalization and enrichment steps. At the end of the pipeline, the normalized MODS records are published to an AWS queue (using SQS), from which they are written to an Solr index which supports the LibraryCloud Item API.
To support LD4L, we will implement the following enhancements:
LibraryCloud messages containing MODS records will be published to an SQS queue that is owned and controlled by an LD4L AWS account. LD4L can elect to receive all records (as part of a full weekly refresh) or just changed records (or both, in separate queues).
Implementation detail: The LibraryCloud ingest pipeline will be enhanced to support a publish-subscribe model, using AWS SNS (Simple Notification Service) and SQS. This will allow management and configuration of additional consumers through AWS, without requiring code changes.
The MODS records published by LibraryCloud will be enhanced to include a link to the full MARC record from which the MODS record was created
Implementation detail: The MARC records will be uploaded to S3 and the link saved in the MARC record as part of an enrichment step during the ingest process.
We will implement a Java SDK to simplify the process of building an application that consumes LibraryCloud data. Users (e.g. developers) will only be required to configure the queue from which they want to read records and implement a single function to process the data.
Implementation detail: The SDK will be managed as a publicly available Github repo containing sample Java code, Maven configuration (pom.xml), Camel configuration files, and libraries for marshalling/unmarshalling LibraryCloud messages. It will also contain Vagrant scripts for spinning up EC2 servers on AWS to run the ingest process.
Ingest contains only IDs
Consumers of data from the LibraryCloud pipeline might prefer to receive just a list of IDs, rather than the full MODS records, especially if, as in the LD4L case, they are primarily going to be working with the MARC data. Another option (in combination with what’s described above) would be to provide access to an SQS queue that only contains record IDs; then LD4L would use those records IDs to lookup the MARC data using the LibraryCloud API. (Another extension would be to allow consumers to directly request a record in its original format from the LibraryCloud API, rather than first requesting the MODS and then separately looking up the original record).
Use S3 for data in the ingest pipeline
The messages in the LibraryCloud pipeline currently cannot be larger than 256K due to SQS message size limits. Our current approach is to limit the number of MODS records in each message to keep the message size below 256K. Another option is to store the message payload in S3, which would eliminate any limit on the size of the messages. Any complexity added would be confined to the standard unmarshalling/marshalling routines. How this would affect throughput is unknown - there would be additional latency in retrieving and saving the payload data at each step, but it would likely be more efficient for enrichment steps to operate on larger chunks of data at a time.
Week of May 25
Week of June 1
Week of June 8
Week of June 15
Week of June 22
Week of June 29
Week of July 6
Week of July 13
Week of July 20
Week of July 2
Week of August 3
LD4L QA and Testing