Class Guide: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines
FSCI 2023
by Simon Worthington; Simon Bowie
Version v1.1
Published by: NextGenBooks - TIB
Last updated: 2023-07-30
Created: 2023-07-29
Language: English (United States)
Created by: Simon Worthington
Publishing from Collections
Publication prototypes
This is a hands-on class for participants with no prior experience of computational publishing using (Jupyter Notebooks) and linked open data (Wikidata and Wikibase). The class has three demonstration use cases for the auto-creation of catalog publications for exhibitions, publication listings, and a reader – made from multiple linked open data (LOD) sources and published as multi-format: web, PDF and, ebook, etc.
Participants will be instriucted in the use of the software pipeline and practice on three use cases:
A painting exhibition catalog
A publishers book catalog
City Climate Change Plan Reader
FSCI 2023 class: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines - https://osf.io/t4j5a/
Coordinated by Simon Worthington - NFDI4Culture @Open Science Lab, TIB, Hannover
Simon Bowie, Centre for Postdigital Cultures | Institute for Creative Cultures | Coventry University. https://orcid.org/0000-0002-2437-589X (COPIM project)
Janneke Adema, Centre for Postdigital Cultures | Institute for Creative Cultures | Coventry University. https://orcid.org/0000-0001-7681-8448 (COPIM project)
A painting exhibition catalog: This demonstrates how Wikidata/base can be used to source content.
A publishers book catalog: Here the book catalog and metadata service Thoth is queried automatically once a day to automatically update the publishers catalog with new titles.
City Climate Change Plan Reader(experimental prototype): A research literature collator to create readers.The example shows how climate change literature can be searched by authors to create referenced readers to support the work on City Climate Change Plan creation. Made with the FSCI Hackathon organizers #semanticClimate
The example workflows have been put together by researchers from the two research consortia NFDI4Culture – German National Research Data Infrastructure, and COPIM (Community-led Open Publication Infrastructures for Monographs) in consultation with the publisher Open Book Publishers, Cambridge (UK).
Class activities and the use cases
Course learning objectives
At the end of the course, participants will be able to:
Objective 1: Learn how to install and operate the pipeline workflow software tools used for retrieving and publishing LOD as multi-format. These include: GitHub, Wikidata, SPARQL Query, Jupyter Notebooks, Quarto multi-format renderer, and GitHub Pages.
Objective 2: Configure two of the ‘use case’ example publication GitHub projects using the pipeline workflow: paintings, and books. The use cases are demonstrations of how LOD can be retrieved from APIs - Wikidata and Thoth.
Objective 3: Gain an introduction to using Wikidata and Wikibase for storing and retrieving LOD, as well as getting a view on how different fields data models are developed by the communities involved.
Course topics
This course will be presented over three days for 1½ hours each day and will cover these topics:
Topic 1: Day 1 - Pipeline install and test with Benchmark project. Introduction to running use case #1: ‘A painting exhibition catalog’ with the pipeline.
Topic 2: Day 2 - Introduction to running use case #2: ‘A publishers book catalog’ with the pipeline.
Topic 3: Day 3 - Introduction to use case #3: ‘City Climate Change Plan Reader(experimental prototype)’. Note: This use case is a work in progress with the FSCI Hackathon and is only for demonstration purposes AKA some parts do not yet work and instead is for early stage problem solving.
Course schedule and acivities: Day 1-3
Note: Activities can be completed in advance of sessions or after sessions by participants.
Presentation details Day 1: Simon Worthington, and Simon Bowie. Instruction and guidance on installing and configuring the pipeline software. Introduction to using Wikidata as a source for a publications content and how to retrieve the content via an API with a Jupyter notebook. Instructions for the exercise of configuring the use case #1 ‘A painting exhibition catalog’ - participants can use the GitHub repo in their own time and carry out the steps in the Class Guide provided.
Activities for participants:
The results of these activities is will be to learn how to create a multi-format output using Wikidata, Jupyter Notebooks, and Quarto. You will transform ADA Painting Notebook to look like this example Fork.
Installing and configuring the pipeline software using this Benchmark Notebook.
Running the use case #1 ‘A painting exhibition catalog’ in Quarto: A chance to learn about the basic workflow steps of multi-format publishing with the pipeline.
Introduction to Wikidata and SPARQL queries
Configuring Quarto: Style, metadata, output formats, adding markdown pages and new notebooks, etc.
Rendering with Quarto as multi-format: PDF, eBook, DOCX, and HTML.
Presentation details Day 2: Simon Worthington, Simon Bowie, and Janneke Adema. Instructions for the exercise of configuring the use case #2 ‘A publishers book catalog’. A brief introduction to Thoth the book bibliographic and metadata service. An introduction to the experimental publishing work of COPIM and its successor project Open Book Futures.
Simon Worthington, Simon Bowie, and Janneke Adema will share insights and findings from the COPIM research on computational publishing - expecially the collaboration with Open Book Publishers and how to integrate computational publishing workflows with conventional publishers workflows.
Configure the Jupyter Notebook to retreive different publishers content from Thoth API. The Notebook being used in the use can optionally retrieve different publishers book information.
Learn about the Pipeline and its use with Thoth. A main addition here is that the Quarto repo has been set to automatically update once a day. You can see the version created by Simon Bowie here. https://simonxix.github.io/scholarled_catalogue/
Presentation details Day 3: Simon Worthington, and Simon Bowie, Hackathon Team semanticClimate. Review work on use case #1 and #2. Introduction to use case #3: ‘City Climate Change Plan Reader(experimental prototype)’ with team semanticClimate where their PyAmi pipeline can be introduced.
This use case uses Wikibase as its LOD source. Wikibase is a self-hosted open-source version of Wikidata. A mockup of use case #3 will be made with semanticClimate, we will review the mockup and look at what the challenges are for a later follow-up next phase prototype. The goals of this prototyping round are:
Create a demonstration mockup
Create a workflow diagram reviewed by peers
Create a roadmap and budget to create an MVP prototype
Activities for participants:
Review the mockup and workflow diagrams of use case #3: ‘City Climate Change Plan Reader'.
Take part in a dicussion with contributors and team members from the FSCI Hackathon semanticClimate.
GitHub onboarding
Learning objective: Onboarding and familiarisation with using GitHub for publishing a website with GitHub Pages.
Turn on GitHub Pages - voilà you have a website :-)
First go to the Settings tab in your repository; second in the left menu go down to Pages; third, secect main, docs - and save. In a few minutes this will turn on your Github Pages website. Congrats!
The last setup step is to add the GitHub Pages URL to the front end information panel of your repository.
Navigate back ton the Code view of your repo. Top your you can add your GitHub Pages URL to the About information of your repository. Open the About area by clicking on the cog icon. The in the dialog window click the use Pages address, and save.
System Installation
Ensure Git is installed and GitHub account has been created before completing install steps.
See section: 'GitHub onboarding'.
Use the Benchmark repository to test that your installation is functioning properly.
To clone this repository from GitHub, ensure that Git is installed on your local machine either as a command line interface (https://git-scm.com/) or through GitHub Desktop (https://desktop.github.com/).
Use either the CLI or GitHub Desktop to clone the repository into your preferred installation directory.
If using CLI, navigate in the terminal to your preferred installation directory and run:
git clone https://github.com/repo-address
Install prerequisites (without Docker)
To install all prerequisites for running this repository on your local machine, please follow the instructions below.
If on Windows pip or python won't run. Check the solution here to add the python path for the Terminal (April '23).
Docker Installation
It's possible (though not required) to use Docker to run the environment for Jupyter Notebook running and Quarto rendering.
This process works in Linux but does not work in macOS due to a known issue. This involves Quarto not running properly in the Docker container in macOS due to the amd64 emulation of Docker Desktop for arm64 MacOS. See discussion at quarto-dev/quarto-cli#3308. This shouldn't occur in any other environment running Docker.
Once installed, navigate in the terminal to the directory for the cloned Git repository.
Run docker-compose up -d --build to start the containers.
The jupyterlab container runs a stand-alone version of JupyterLab on http://localhost:8888. This can be used to edit any Jupyter Notebook files in the repository. The JupyterLab instance runs with the password 'jupyterlab'.
The nginx container runs Nginx webserver and displays the static site that Quarto renders. This runs at http://localhost:1337.
The quarto container starts a Ubuntu 22.04 container, installs various things like Python, downloads Quarto and installs it, and then adds Python modules like jupyter, matplotlib, and panda. It then runs in the background so Quarto can be called on to render the qmd and ipynb files into the site/book like so:
docker exec -it quarto quarto render
When you're finished using the code, run docker-compose down to stop the containers.
In Extensions (Menu: View > Extensions) install Python, Jupyter Notebooks, and Quarto. Search for each term and click the install button and follow instructions.
Change the filter for subject depicted in the paintings
Change the number to items retrieved
Step-by-step instructions
Note: You will use the lefthand 'Query helper' GUI, where you can type in names of items. Sometimes you need to enter a term twice to get the correct item to appear.
Filter: collectionP195, Bavarian State Painting CollectionQ812285 - wdt:P195 wd:Q812285. Change to a painting collection of your choice - see list of collections on Wikipedia.
Filter: depictsP180, riverQ4022 - change to your preference
Limit, bottom left set limit to number of items returned.
Play button - bottom left - renders query below
Image grid view :-)
Save your query: Options. Bottom right you will see a Link icon, clikc this and you can access a Short Link creator (but this is often not working). Alternatively you can copy the URL of your query from the browser address bar. Save the link in your browser bookmakr bar or in a text editor.
Transfering a Wikidata SPARQL query to a Juypter Notebook
We are using a pre-configured Jupyter Notebook that combines the SPARQL query and custom Python code. The Python code adds formatting to the SPARQL output to enable the Markdown rendering.
Wikidata's query builder cannot create all the Python needed for the Jupyter Notebook which means a Python coder to create the Python needed.
In this example we are only editing three perameters
Load the whole repository folder into your editor.
The key file to edit in Quart is _quarto.yml. This file contains the main configurations for your publication.
If you are using VSC you can run Notebooks as well as use the Terminal to run Quarto commands, and commit to GitHub.
Change repo address and make repo link visible in publication
Edit _quarto.yml
If you are working on a fork the first thing you need to do is edit the repository address on line 19 - this will point the GitHub icon in your publication to your own GitHub repo.
If you are using GitHub Desktop, then just do the following steps:
Close GitHub Desktop and all other applications with open files to your current directory path.
Move the whole directory as mentioned above to the new directory location. (NB: The fdirectory has to be completely moved).
Open GitHub Desktop and click on the blue (!) "repository not found" icon. Then a dialog will open and you will see a "Locate..." button which will open a popup allowing you to direct its path to a new location.
Software (open-source)
Over 2023/24 the computational components will be added to the ADA Semantic Publishing Pipeline as well as introducing Vivliostyle Create Book markdown renderer and swapping to Jupyter Book computational book platform away from Quarto – https://github.com/NFDI4Culture/ada