Class Guide: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines

Publishing from Collections

Publication prototypes

This is a hands-on class for participants with no prior experience of computational publishing using (Jupyter Notebooks) and linked open data (Wikidata and Wikibase). The class has three demonstration use cases for the auto-creation of catalog publications for exhibitions, publication listings, and a reader – made from multiple linked open data (LOD) sources and published as multi-format: web, PDF and, ebook, etc.

Participants will be instriucted in the use of the software pipeline and practice on three use cases:

A painting exhibition catalog
A publishers book catalog
City Climate Change Plan Reader

FSCI 2023 class: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines - https://osf.io/t4j5a/

Coordinated by Simon Worthington - NFDI4Culture @Open Science Lab, TIB, Hannover

FSCI 2023 instructors

Simon Worthington, (Lead) TIB, NextGenBooks - OA book production service; https://orcid.org/0000-0002-8579-9717 (NFDI4Culture project)
Simon Bowie, Centre for Postdigital Cultures | Institute for Creative Cultures | Coventry University. https://orcid.org/0000-0002-2437-589X (COPIM project)
Janneke Adema, Centre for Postdigital Cultures | Institute for Creative Cultures | Coventry University. https://orcid.org/0000-0001-7681-8448 (COPIM project)

August 2023

FigureWorkflow for use case #1 Paintings - see full view here.

Important links for the class

Class docs: https://osf.io/t4j5a/
Support: https://github.com/orgs/NFDI4Culture/projects/1
FSCI Slack: #e08-publishing-from-collections-using-linked-open-data-source
GitHub repositories used in the class
- ADA - Benchmark Notebook - For testing purposes https://github.com/NFDI4Culture/ada-benchmark-notebook
- ADA Painting Notebook - Use case #1 https://github.com/NFDI4Culture/ada-painting-notebook
- ADA Book Notebook - Use case #2 https://github.com/NFDI4Culture/ada-book-notebook
- City Climate Plans Notebook - Use case #3 https://github.com/semanticClimate/city-climate-plans-notebook/tree/main

Other Helpful Information

Outline - IPCC Reports and City Climate Change Plans: Proof of concept prototype - https://semanticclimate.org/city-climate-plans/
semanticClimate Hackathon information - https://fsci2023.sched.com/event/1OCpV/force11-hackathon-kickoff
FORCE11 Hackathon Kickoff: Climate Knowledge Hunt Hackathon FSCI Edition 2023 https://docs.google.com/document/d/1hICLYwl6StYNsYE6q1nJNor4dPj0nkc0HSNOLXikhwY/edit#heading=h.izat09h6e2lx
semanticClimate Hackathon FSCI Slack channel - #hackathon-semanticclimate https://fsci2023.slack.com/archives/C05GTHT92HJ
About Thoth platform - https://thoth.pub/
About ADA Computational Publishing Pipeline - https://github.com/NFDI4Culture/ada

Document DOI: To be confirmed | Author: Simon Worthington https://orcid.org/0000-0002-8579-9717 | CC BY-SA 4.0 International.

All software used is open-source OSI licence compliant. All content and other resources are open access with open licences.

To edit this document - request access by emailing simon.worthington@tib.eu.

Publication use cases

A painting exhibition catalog: This demonstrates how Wikidata/base can be used to source content.
A publishers book catalog: Here the book catalog and metadata service Thoth is queried automatically once a day to automatically update the publishers catalog with new titles.
City Climate Change Plan Reader (experimental prototype): A research literature collator to create readers. The example shows how climate change literature can be searched by authors to create referenced readers to support the work on City Climate Change Plan creation. Made with the FSCI Hackathon organizers #semanticClimate

The example workflows have been put together by researchers from the two research consortia NFDI4Culture – German National Research Data Infrastructure, and COPIM (Community-led Open Publication Infrastructures for Monographs) in consultation with the publisher Open Book Publishers, Cambridge (UK).

Class activities and the use cases

Course learning objectives

At the end of the course, participants will be able to:

Objective 1: Learn how to install and operate the pipeline workflow software tools used for retrieving and publishing LOD as multi-format. These include: GitHub, Wikidata, SPARQL Query, Jupyter Notebooks, Quarto multi-format renderer, and GitHub Pages.
Objective 2: Configure two of the ‘use case’ example publication GitHub projects using the pipeline workflow: paintings, and books. The use cases are demonstrations of how LOD can be retrieved from APIs - Wikidata and Thoth.
Objective 3: Gain an introduction to using Wikidata and Wikibase for storing and retrieving LOD, as well as getting a view on how different fields data models are developed by the communities involved.

Course topics

This course will be presented over three days for 1½ hours each day and will cover these topics:

Topic 1: Day 1 - Pipeline install and test with Benchmark project. Introduction to running use case #1: ‘A painting exhibition catalog’ with the pipeline.
Topic 2: Day 2 - Introduction to running use case #2: ‘A publishers book catalog’ with the pipeline.
Topic 3: Day 3 - Introduction to use case #3: ‘City Climate Change Plan Reader (experimental prototype)’. Note: This use case is a work in progress with the FSCI Hackathon and is only for demonstration purposes AKA some parts do not yet work and instead is for early stage problem solving.

Course schedule and acivities: Day 1-3

Note: Activities can be completed in advance of sessions or after sessions by participants.

Presentation details Day 1: Simon Worthington, and Simon Bowie. Instruction and guidance on installing and configuring the pipeline software. Introduction to using Wikidata as a source for a publications content and how to retrieve the content via an API with a Jupyter notebook. Instructions for the exercise of configuring the use case #1 ‘A painting exhibition catalog’ - participants can use the GitHub repo in their own time and carry out the steps in the Class Guide provided.

Activities for participants:

The results of these activities is will be to learn how to create a multi-format output using Wikidata, Jupyter Notebooks, and Quarto. You will transform ADA Painting Notebook to look like this example Fork.

Installing and configuring the pipeline software using this Benchmark Notebook.
Running the use case #1 ‘A painting exhibition catalog’ in Quarto: A chance to learn about the basic workflow steps of multi-format publishing with the pipeline.
- Introduction to Wikidata and SPARQL queries
- Configuring Quarto: Style, metadata, output formats, adding markdown pages and new notebooks, etc.
- Rendering with Quarto as multi-format: PDF, eBook, DOCX, and HTML.

Presentation details Day 2: Simon Worthington, Simon Bowie, and Janneke Adema. Instructions for the exercise of configuring the use case #2 ‘A publishers book catalog’. A brief introduction to Thoth the book bibliographic and metadata service. An introduction to the experimental publishing work of COPIM and its successor project Open Book Futures.

Simon Worthington, Simon Bowie, and Janneke Adema will share insights and findings from the COPIM research on computational publishing - expecially the collaboration with Open Book Publishers and how to integrate computational publishing workflows with conventional publishers workflows.

Activities for participants:

Running the use case #2 ‘A publishers book catalog’.
Configure the Jupyter Notebook to retreive different publishers content from Thoth API. The Notebook being used in the use can optionally retrieve different publishers book information.
Learn about the Pipeline and its use with Thoth. A main addition here is that the Quarto repo has been set to automatically update once a day. You can see the version created by Simon Bowie here. https://simonxix.github.io/scholarled_catalogue/

Presentation details Day 3: Simon Worthington, and Simon Bowie, Hackathon Team semanticClimate. Review work on use case #1 and #2. Introduction to use case #3: ‘City Climate Change Plan Reader (experimental prototype)’ with team semanticClimate where their PyAmi pipeline can be introduced.

This use case uses Wikibase as its LOD source. Wikibase is a self-hosted open-source version of Wikidata. A mockup of use case #3 will be made with semanticClimate, we will review the mockup and look at what the challenges are for a later follow-up next phase prototype. The goals of this prototyping round are:

Create a demonstration mockup
Create a workflow diagram reviewed by peers
Create a roadmap and budget to create an MVP prototype

Activities for participants:

Review the mockup and workflow diagrams of use case #3: ‘City Climate Change Plan Reader'.
Take part in a dicussion with contributors and team members from the FSCI Hackathon semanticClimate.

GitHub onboarding

Learning objective: Onboarding and familiarisation with using GitHub for publishing a website with GitHub Pages.

Use Benchmark repository as a sample repo to Fork, Clone, and turn on Github Pages: https://github.com/NFDI4Culture/ada-benchmark-notebook

GitHub support: https://support.github.com/

Carry out the following steps

GitHub support links included.

Creating a GitHub account
Install Github Desktop
Fork the Benchmark repository (from the GitHub website)
Cloning a repository - use GitHub Desktop
Turn on GitHub Pages - voilà you have a website :-)
- First go to the Settings tab in your repository; second in the left menu go down to Pages; third, secect main, docs - and save. In a few minutes this will turn on your Github Pages website. Congrats!
- FigureTurn on GitHub Pages
- The last setup step is to add the GitHub Pages URL to the front end information panel of your repository.
  - Navigate back ton the Code view of your repo. Top your you can add your GitHub Pages URL to the About information of your repository. Open the About area by clicking on the cog icon. The in the dialog window click the use Pages address, and save.
  - FigureAdd Pages URL to repo frontend

System Installation

Ensure Git is installed and GitHub account has been created before completing install steps.

See section: 'GitHub onboarding'.

Use the Benchmark repository to test that your installation is functioning properly.

ADA Benchmark Notebook: https://github.com/NFDI4Culture/ada-benchmark-notebook

Support

Raise a support ticket: https://github.com/orgs/NFDI4Culture/projects/1/
Help chat on Slack #e08-publishing-from-collections-using-linked-open-data-source

Options for installation

For general purposes use the manual install.

The Docker install is for when you are running multiple environments on your computer or carrying out long term development.

Manual installation
Docker installation
Visual Studio Code installation (IDE - Intergrated Development Environment)

Manual installation

Clone repository

To clone this repository from GitHub, ensure that Git is installed on your local machine either as a command line interface (https://git-scm.com/) or through GitHub Desktop (https://desktop.github.com/).

Use either the CLI or GitHub Desktop to clone the repository into your preferred installation directory.

If using CLI, navigate in the terminal to your preferred installation directory and run:

git clone https://github.com/repo-address

Install prerequisites (without Docker)

To install all prerequisites for running this repository on your local machine, please follow the instructions below.

First, install Python following the instructions at https://www.python.org/downloads/

Once Python is installed, navigate to the quarto_docker directory in terminal and run:

pip install -r requirements.txt

This should install all the required Python modules for running the Quarto rendering process.

Next, install the Quarto CLI following the instructions at https://quarto.org/docs/get-started/

Finally, install an environment for viewing and editing Jupyter Notebook files. This can be Visual Studio Code (https://code.visualstudio.com/), the open source fork VSCodium (https://vscodium.com/), or a dedicated Jupyter environment like JupyterLab (https://quarto.org/docs/get-started/hello/jupyter.html).

Troubleshooting

https://quarto.org/docs/get-started/

Quarto help docs: https://quarto.org/

Note: They miss out that you need Python installed and Jupyter Lab and a working terminal to install them

https://www.python.org/downloads/

https://jupyter.org/install

Also install Panda: py -m pip install panda

As well as knowing what the Python prompt and escape looks like: https://stackoverflow.com/questions/41524734/how-to-exit-python-script-in-command-prompt

If on Windows pip or python won't run. Check the solution here to add the python path for the Terminal (April '23).

Docker Installation

It's possible (though not required) to use Docker to run the environment for Jupyter Notebook running and Quarto rendering.

This process works in Linux but does not work in macOS due to a known issue. This involves Quarto not running properly in the Docker container in macOS due to the amd64 emulation of Docker Desktop for arm64 MacOS. See discussion at quarto-dev/quarto-cli#3308. This shouldn't occur in any other environment running Docker.

To run in Docker, first install Docker Desktop following the instructions at https://docs.docker.com/desktop/.

Once installed, navigate in the terminal to the directory for the cloned Git repository.

Run docker-compose up -d --build to start the containers.

The jupyterlab container runs a stand-alone version of JupyterLab on http://localhost:8888. This can be used to edit any Jupyter Notebook files in the repository. The JupyterLab instance runs with the password 'jupyterlab'.

The nginx container runs Nginx webserver and displays the static site that Quarto renders. This runs at http://localhost:1337.

The quarto container starts a Ubuntu 22.04 container, installs various things like Python, downloads Quarto and installs it, and then adds Python modules like jupyter, matplotlib, and panda. It then runs in the background so Quarto can be called on to render the qmd and ipynb files into the site/book like so:

docker exec -it quarto quarto render

When you're finished using the code, run docker-compose down to stop the containers.

Visual Studio Code installation

Install Visual Studio Code (https://code.visualstudio.com/)

In Extensions (Menu: View > Extensions) install Python, Jupyter Notebooks, and Quarto. Search for each term and click the install button and follow instructions.

Download the following repository ADA Benchmark Notebook: https://github.com/NFDI4Culture/ada-benchmark-notebook

Open the repository into a new Window in Visual Studio Code.

In the TERMINAL and run: pip install -r requirements.txt

Test Quarto in the TERMINAL and run: quarto check

[>] Checking Quarto installation......OK Version: 1.2.475 Jupyter: 4.11.2 Kernels: python3

[>] Checking Jupyter engine render....OK

[>] Checking R installation...........(None) Unable to locate an installed version of R. Install R from https://cloud.r-project.org/

Create a Wikidata query

Objective: To builds a Wikidata query. See example query: paintings, query link.

Allows for non-expert query building with plain language

View query as plain language and as code
Experience of building a query
Contact with some basic building blocks of Wikidata
View and export SPARQL query

Notes: Wikidata Query (help)

Steps

Go to the sample painting collection query here.
Example perameters to change:
1. Change the collection used
1. Change the filter for subject depicted in the paintings
1. Change the number to items retrieved

Step-by-step instructions

Note: You will use the lefthand 'Query helper' GUI, where you can type in names of items. Sometimes you need to enter a term twice to get the correct item to appear.
1. Enable split view with i info button top left.
2. Filter: instance of P31, painting Q3305213 - wdt:P31 wd:Q3305213.
3. Filter: collection P195, Bavarian State Painting Collection Q812285 - wdt:P195 wd:Q812285. Change to a painting collection of your choice - see list of collections on Wikipedia.
4. Filter: depicts P180, river Q4022 - change to your preference
5. Limit, bottom left set limit to number of items returned.
6. Play button - bottom left - renders query below
7. Image grid view :-)
8. Save your query: Options. Bottom right you will see a Link icon, clikc this and you can access a Short Link creator (but this is often not working). Alternatively you can copy the URL of your query from the browser address bar. Save the link in your browser bookmakr bar or in a text editor.

Transfering a Wikidata SPARQL query to a Juypter Notebook

A painting collection

We are using a pre-configured Jupyter Notebook that combines the SPARQL query and custom Python code. The Python code adds formatting to the SPARQL output to enable the Markdown rendering.

Wikidata's query builder cannot create all the Python needed for the Jupyter Notebook which means a Python coder to create the Python needed.

In this example we are only editing three perameters

Collection
Depicts
Limit

The Wikidata query builder is only being used for previewing the query.

On line 24 and 25 of the Notebook you can change the collection and depiction codes to your updated IDs:

wdt:P195 wd:Q160236; # Part of the Metropolitan Museum of Art collection

wdt:P180 wd:Q4022. # Depicts 'river'

Change the limit on line 49:

LIMIT 27

Save and run the Notebook cell to see the results displayed in the same way as the Wikidata query perview.

Save the Notebook.

Jupyter Notebooks: Setup, Editing, and Saving

General instructions for using Jupyter Notebooks.

Installation of Python and Jupyter Notebooks is covered in Quarto install section.

The default editor we use is Visual Code Studio, but you can use other Notebook editors.

Steps

First run: requirements.txt file for Python libraries. See: https://note.nkmk.me/en/python-pip-install-requirements/

Use:

# install with 'pip install -r requirements.txt'

Then you can run, edit, and save Notebooks.

Editing Quarto

Visual Studio Code is one option as an editor, but you can use any editor suite that you like.

Install Visual Studio Code (VSC)

https://code.visualstudio.com/

Editing

Load the whole repository folder into your editor.

The key file to edit in Quart is _quarto.yml. This file contains the main configurations for your publication.

If you are using VSC you can run Notebooks as well as use the Terminal to run Quarto commands, and commit to GitHub.

Change repo address and make repo link visible in publication

Edit _quarto.yml

If you are working on a fork the first thing you need to do is edit the repository address on line 19 - this will point the GitHub icon in your publication to your own GitHub repo.

repo-url: https://github.com/my-repository-address

Add or move a publication sections (chapters)

Edit _quarto.yml

To add new sections to your publication jast add file names to the chapter list after line 12.

Markdown files can be added
Jupyter Notesbooks can be added

The publication home page uses the file index.qmd

Adding multi-format output formats

Edit _quarto.yml

For books the following output formats can be added: HTML, PDF, MS Word, EPUB, AsciiDoc.

https://quarto.org/docs/books/

Note: epub needs a cover adding

epub:

cover-image: cover.png

Change theme (style)

Edit _quarto.yml

https://quarto.org/docs/output-formats/html-themes.html

Add metadata (HTML)

Edit _quarto.yml

https://quarto.org/docs/reference/formats/html.html

Other settings

The other settings can we read about on the Quarto support pages, for example Book Structure.

Quarto Rendering

Quarto help docs: https://quarto.org/

Using quarto to render as multi-format.

Use the command line to run Quarto: Powershell, GitBash, Cygwin, OSX shell, or terminal in Visual Studio Code ('VSC), etc.

Execute the Quarto commands from the top level of your publication repository.

Steps

Quarto rendering workflow. Courtesy Quarto.

IMPORTANT! Run and save Jupyter Notebooks, install the requirements.txt file before running Notebooks. SAVE all files.
Work locally with command quarto preview - this will launch a browser window to preview your publication.
Render: Use command quarto render.
Upload to GitHub when ready using commit and push - Use GitHub Desktop, or from V'SC, etc.

Troubleshooting

Problems encountered so far:

Cover image: file needs to be local for epub rendering
Python path on Windows: If on Windows pip or python wont run. Check the solution here to add the python path for the Terminal April 23.
CMD.EXE - Quarto wont run. See https://www.windows-faq.de/2017/02/27/unc-pfade-in-der-eingabeaufforderung-benutzen/ (no fix as yet - April 2023.
Moving a repository locally: From Stackoverflow.

If you are using GitHub Desktop, then just do the following steps:
1. Close GitHub Desktop and all other applications with open files to your current directory path.
2. Move the whole directory as mentioned above to the new directory location. (NB: The fdirectory has to be completely moved).
3. Open GitHub Desktop and click on the blue (!) "repository not found" icon. Then a dialog will open and you will see a "Locate..." button which will open a popup allowing you to direct its path to a new location.

Software (open-source)

Over 2023/24 the computational components will be added to the ADA Semantic Publishing Pipeline as well as introducing Vivliostyle Create Book markdown renderer and swapping to Jupyter Book computational book platform away from Quarto – https://github.com/NFDI4Culture/ada

ADA (Advanced Document Architecture - Semantic Publishing Pipeline) - https://github.com/NFDI4Culture/ada
Wikidata – https://www.wikidata.org/
Jupyter Notebooks – https://jupyter.org/
Jupyter Book – https://jupyterbook.org/
Quarto – https://quarto.org/
VSCodium – (https://vscodium.com/)
Semantic Kompakkt – https://semantic-kompakkt.de/
TIB AV Portal – https://av.tib.eu/
HedgeDoc – https://HedgeDoc.org/
Thoth – https://thoth.pub/
Vivliostyle – https://vivliostyle.org/
- Create Book – Markdown renderer
Wikibase – https://wikiba.se/
Zenodo - https://zenodo.org/
NextCloud - Text editor and Markdown editor - Text : https://github.com/nextcloud/text Markdown: https://apps.nextcloud.com/apps/files_markdown
semanticClimate - tech stack: https://github.com/petermr/semanticClimate

AI Software (open-source)

To be confirmed:

https://openai.com/blog/chatgpt
https://www.perplexity.ai/
h2oGPT - https://gpt.h2o.ai/
- Models
  - h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
  - h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
  - h2oGPT [lmsys/vicuna-33b-v1.3]
  - h2oGPT [gpt-3.5-turbo] (gpt3.5-turbo is chatgpt) not open-source

Class Guide: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines

FSCI 2023

by Simon Worthington; Simon Bowie

Version v1.1

Publication prototypes

FSCI 2023 instructors

Important links for the class

Course learning objectives

Course topics

Course schedule and acivities: Day 1-3

Activities for participants:

Activities for participants:

Activities for participants:

Carry out the following steps

Support

Options for installation

Manual installation

Clone repository

Install prerequisites (without Docker)

Troubleshooting

Docker Installation

Visual Studio Code installation

Steps

A painting collection

Steps

Install Visual Studio Code (VSC)

Editing

Change repo address and make repo link visible in publication

Add or move a publication sections (chapters)

Adding multi-format output formats

Change theme (style)

Add metadata (HTML)

Other settings

Steps

Troubleshooting

AI Software (open-source)