Timeframe: Jan 1, 2014 - Dec 31, 2015

Original Project

The LD4L project was a collaboration of the Cornell University Library, the Harvard Library Innovation Lab, and the Stanford University Libraries, and is funded by a nearly $1 million two-year grant from the Andrew W. Mellon Foundation.

The goal of the project is to create a Scholarly Resource Semantic Information Store (SRSIS) model that works both within individual institutions and through a coordinated, extensible network of Linked Open Data to capture the intellectual value that librarians and other domain experts and scholars add to information resources when they describe, annotate, organize, select, and use those resources, together with the social value evident from patterns of usage.

The public project wiki can be found at https://wiki.duraspace.org/pages/viewpage.action?pageId=41354028

The internal project wiki is at https://wiki.duraspace.org/display/ld4lPLAN/LD4L+Planning+Home

The related LD4P project is documented at LD4P wiki page

Harvard's LD4L 1 goals

  1. Periodically process the Harvard bibliographic record set through both the Library of Congress BIBFRAME converter and the BIBFRAME post processor being created by Cornell, and deliver the processed RDF records sets to Cornell
  2. In collaboration with Cornell, Develop a procedure to enrich BIBFRAME records with Stackscore usage information, process the Harvard record set, and deliver incremental stackscore triples to Cornell
  3. Develop a procedure to enrich BIBFRAME records with links to person identifiers in Harvard Faculty Finder, process the Harvard record set, and deliver incremental triples to Cornell
  4. Evaluate eagle-i as a library linked data hosting environment. This project has been abandoned due to inadequate technical documentation on eagle-i.
  5. Report on early entity resolution experiments performed by Paolo Ciccarese in the first year of the grant.

Harvard roadmap and deliverables (Year 2)

Year 2 Resources: Jonathan Kennedy (developer - 350 hours), Paul Deschner (developer, contributed time), Robin Wendler (metadata, contributed time), Michelle Durocher (metadata, LD4P, contributed time), Michael Vandermillen (library cloud support, contributed time), Jeff Licht (library cloud support, 50 hours)

PLAN (updated 1/15/16)

9/1511/1/15JKAvailable to work on LD4Ldone 
9/30AbandonedJKModify Eagle-I to remove non-Open-RDF dependenciesAbandonedIn December, Jonathan determined that he will be unable to accomplish this due to lack of technical documentation on eagle-i. Only the original eagle-i developers could do this.
9/30AbandonedMD/JKCreate Eagle-I configuration mapping file for BIBFRAME (or a subset of BIBFRAME if the whole thing is too much)AbandonedHarvard eagle-i or triple store abandoned.
8/158/15/15RSLTS - spin up an AWS server to run Eagle-Idone 
10/15AbandonedJKSet up Eagle-I triple store on AWS machine, and deploy BIBFRAME mapping file to Eagle-IAbandonedHarvard eagle-i or triple store abandoned.
10/30AbandonedallTestAbandonedEagle-i instance installed and loaded with TBD set of Harvard triples evaluation
10/301/30/16JKdevelop component for enrichment with entity resolution for OCLC works and Faculty Finder  
not planned12/7/15PDRe-run Harvard marc through Cornell specified version of LC converterdone 
10/30

12/21/15

CornellLC BIBFRAME --> LD4L BIBFRAME converter availabledone 
11/15will not doJK/JLIntegrate with Library Cloud SDK to auto update our RDF with Cornell converterAbandonedWill be part of LD4L 2 grant
11/151/7/16PDReprocess our BIBFRAME and deliver updated set to CornelldoneZip file of un-enriched LD4L BIBFRAME RDF
11/151/30/16JKenrich our RDF with entity resolution for Faculty Finderin progress 
 AbandonedJKWork on scaling the triple store to load all of our records, with CornellAbandonedHarvard eagle-i or triple store abandoned.
 12/15/15PDdevelop usage ontologydoneUsage ontology
12/1/151/15/16JK/PDenrich our RDF with Stackscore using the ontologyin progress 
 AbandonedMD/JKupdate our Eagle-I mapping file to include new ontologies used in Cornell converterAbandonedHarvard eagle-i or triple store abandoned.
12/15/151/10/16CornellFinal Cornell converter available  
12/15/15cornell will doPDreprocess our RDF with new Cornell converter and our faculty finder and stackscore enrichmentcornell will do 
12/15/151/30/16JK/PDShip new RDF (with faculty finder and stackscore) to Cornell Zip file of final RDF, enriched with OCLC works, faculty finder links, and stackscore
1/31/16AbandonedMV/JKExpose our RDF at the record level from id.lib.harvard.edu/aleph/xxx/rdfAbandonedWould have been: Harvard minted URIs exposed as linked data (open to project team only). Harvard eagle-i or triple store abandoned.

 

MUST DO

  1. (Systems) Convert the Harvard open access MARC data set to BIBFRAME RDF
    1. (JK) Set up server for repo, AWS, Eagle-I triple store
    2. (MV, JL) Implement update to Library Cloud to publish MARC in addition to MODS

    3. (JK,JL) Create scalable BIBFRAME converter as new publish target for Library Cloud

    4. (JK) Integrate "LD4L BIBFRAME converter" from Cornell/Stanford into Library Cloud pipeline
    5. (JK) Cron BF pipeline with Library Cloud

    6. (JK) develop reports for conversion failures

  2. (Ontology) (PD) Write up a summary of conversion issues
  3. (Ontology) (PD) Define an ontology for usage data, using OWL
  4. (Systems) (JK, PD) Add Stackscore to our BIBFRAME records using the new ontology
  5. (Systems) (JK) Install a triple store and load our BIBFRAME into the triple store
  6. (Systems) (JK) Expose a web service interface for the triple store to expose our BIBFRAME URIs as well as Catalyst/VIVO ontology URIs
  7. (Ontology) (MD) Demonstration project with Harvard Film Archive
    1. Create RDF from FileMaker metadata and links to other other sources on the web (ala Linked Jazz) using existing ontologies as much as possible
  8. Strawman Schedule:
    1. 5/15/15 - EAGLE-I instance set up at Harvard, scalability testing performed
    2. 7/1/15 – Cornell provide converter from MARC to “LD4L BIBFRAME ontology” – however that differs from the LC version – perhaps new java code developed by Cornell?
    3. 7/1/15 - Library Cloud feeding MARCàRDF records into Eagle-I (if scalability of triple store OK) – This makes the Harvard feed an ongoing pipeline
    4. 8/1/15 – Harvard MARC triples enriched with Stackscore
    5. 8/15/15 – Harvard delivers first dump of RDF MARC triples to Cornell
    6. 8/15/15 – Harvard MARC triples enriched with links to faculty finder
    7. 9/15/15 -  Library Cloud feeding VIA/MODS records into VITRO, second dump to Cornell
    8. 10/15/15 – Some (TBD) Harvard MARC fields and/or VIA fields enriched with links to other sources (OCLC works records, VIAF, DBPedia, ??)
    9. 11/1/15 – Third dump to Cornell for testing
    10. 11/1/5 – Harvard SPARQL end point functional

VERY NICE TO DO

  1.  (JK) Enrich our BIBFRAME data with URI links for Harvard faculty to the Catalyst faculty
    1.  Does this need minted intermediate URIs?
  2. (JK) Connect records to OCLC works
  3. Enrich our BIBFRAME data with URI links to VIAF for other persons
    1. Does this need minted intermediate URIs?
  4. (PD) If usage data is available from Cornell and Stanford, Extract it from Cornell and Stanford RDF. Use Case 5.1: Deploy it into StackLife at Harvard, Use Case 5.2 Deploy it into Haystacks.
  5. (MD,RW) - Define with Cornell and Stanford how named entities should be represented in LD4L RDF

NICE TO DO (could be Phase 2)

  1. If Library of Congress releases a MODS->BIBFRAME xslt, Use Library Cloud as the source for our BIBFRAME (thus providing a pipeline of up to date data, and also adding VIA and OASIS records) - or just continue Paolo's VIA2RDF work with VIA records directly.
  2. Enrich our BIBFRAME data with URI links to TGN or DBPedia for geographic names (potentially using Dave Siegal’s MARC geotagger)
  3. Enrich our BIBFRAME data with URI links to LCSH

Harvard sub-projects

Ontology creation

Harvard MARC->BIBFRAME conversionPaul Deschner, Michelle DurocherApply the Library of Congress converter to create RDF triples from Harvard MARC records. Assess the results and prepare an RDF set for loading into a Harvard instance of the Scholarly Resource Semantic Information Store (SRSIS)

-convert all Harvard open access MARC to BF RDF

-deliver file of RDF to Jonathan for loading into a triple store

-enrich BF RDF with links for people, places, subjects

-provide feedback on adequacy of LC converter

-metadata experts evaluate the conversion

-group common problems

-recommendations to github bibframe group

-possibly add local extensions to converter or modify incoming MARC

-load into Harvard triple store

-propose Harvard metrics for "success"

-no current plan to enrich this RDF with links but could pull linkage from OCLC records (fields TBD - viaf or isni ids are present for some fields)

 

82% of 13.6 million bib records can be matched to OCLC Work identifiers.Paul measured an 82% match-rate between OCLC master records (and by extension work ID’s) and Aleph ID’s.  Each LD4L partner was asked to do this analysis.  Paul wrote up the results here: https://docs.google.com/document/d/1k0Z3mnY3ex83CmXbiJM0w9PeO7-MzQCzmwhhE06I9JU/edit

If we ever want to consider MODS from Library Cloud as a source, there is an ontology: http://www.loc.gov/standards/mods/modsrdf/v1/modsrdf.owl

evaluation tool for our BIBFRAME conversion work: https://marc2bf-eval.herokuapp.com/

Read-only version: http://marc2bf-eval-ro.herokuapp.com/

 

Ontology and enrichment process for usage dataPaul DeschnerCreate an ontology for the type of usage data exposed in LilCloud. Enrich the Harvard RDF set with this data.

-OWL ontology for rich usage data

-Add stackscore to our BF RDF records

 

-Present at Feb 2015 meeting, decide on next steps

-Add Stackscore to the BIBFRAME records created from running the LC converter.

Usage ontology roughly defined, but decided that Stackscore is the only viable element due to privacy and confidentiality concerns
Ontology and enrichment process for VIA dataPaolo CiccareseCreate a converter to create RDF triples from Harvard VIA records. Assess the results and prepare an RDF set for loading into the Scholarly Resource Semantic Information Store (SRSIS) 

-convert VIA data to BF RDF and deliver files to Cornell

-enrich BF RDF with links for people, places, subjects

-pipeline for enrichment

-matching of placename data from Getty

-Evaluate Paolo code and continue creating triples and load in Harvard triple store

 

the code is here:
https://github.com/paolociccarese/via2rdf-java/

the instructions here:
https://github.com/paolociccarese/via2rdf-java/wiki/Set-up

This project relies on two libraries I've created:
https://github.com/json-dp/json-dp-java (JSON format with provenance)
https://github.com/dpf-java (pipeline system)

The two libraries are more stable.
I am currently working on the pipeline (via2rdf-java) to have something to show for the workshop.

Use case 5.1 End user to discover popular resources  demo StackLife and Haystacks at Feb 2015 meeting, get feedback

Stackscore used in StackLife and Library Cloud.

Not yet enriched into our BIBFRAME 

Use case 5.2 Collection manager wants to understand usage of collection    

Systems development

Set up a linked data system environment at HarvardJonathan Kennedy, Michael VandermillenThis will likely be either a clone of Cornell’s VITRO or of the catalyst Eagle-I system. Both have triple stores, but we need one scalable to handle all Harvard’s BIBFRAME records – could be one billion triples. Jonathan indicated he could do this too. LTS may want to provide an Amazon instance to run it.

-Create a linked data endpoint to resolve URIs and deliver RDF

-make it accessiblle to Cornell and Stanford

-set up triple store at Harvard with SPARQL capability

-scalability assessment

Set up Eagle Eye on a HMS server

-add visual interface to use bib record terms

-setup bibframe pipeline to feed repository

-add SPARQL endpoint

-make some use of reasoning

 

meeting with Eagle-I (Catalyst Informatics team) to plan server setup
Load Harvard BIBFRAME, VIA, usage, and faculty data into the system, possibly supporting  SPARQL queries.Paul Deschner, Jonathan KennedyExpose Harvard RDF through an HTTP SPARQL end point. Include faculty data from Harvard Catalyst, which uses the VIVO ontology.

-make Harvard RDF available through HTTP

-include VIVO triples from Harvard faculty finder in our data set

-use a triple store

-support SPARQL

  
Enrich Harvard BIBFRAME data with more linksJonathan KennedyStrings to things. For some set of authorities, update strings in the basic BIBFRAME to URIs. Start with people, places, and subjects. 

-New code to lookup viaf URIs based on LC number in our MARC, and add URIs to our BIBFRAME data

-New code to lookup Faculty Finder/VIVO URIs based on viaf ID that was linked above, and add URIs to our BIBFRAME data

 -New code to lookup placename URIs from DBpedia based on geographic places in our MARC, and add URIs to our BIBFRAME data

-New code to lookup LCSH URIs based on LC number in our MARC, and add URIs to our BIBFRAME data

  

Collaboration

  1. Participate in phone calls
    1. Weekly ontology call – Paul, Jonathan, Robin, Michelle
    2. Bi-weekly systems call – Jonathan, Randy Stern
    3. Monthly all-hands call – All of the above, plus Randy
    4. Annual meeting - plan for attending the LD4L meeting at Stanford 2/23-2/26
      1. Robin, Randy, Jonathan, Paolo, and Paul plan to attend

 

Timeline (from Dec. 2014 CNI presentation)

Project timeline  Jan-June 2015

Project timeline  July-Dec 2015