Class Guide: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines

FSCI 2023

by Simon Worthington; Simon Bowie

Version v1.1

Published by: NextGenBooks - TIB

Last updated: 2023-07-30

Created: 2023-07-29

Language: English (United States)

Created by: Simon Worthington

Publishing from Collections

Publication prototypes

This is a hands-on class for participants with no prior experience of computational publishing using (Jupyter Notebooks) and linked open data (Wikidata and Wikibase). The class has three demonstration use cases for the auto-creation of catalog publications for exhibitions, publication listings, and a reader – made from multiple linked open data (LOD) sources and published as multi-format: web, PDF and, ebook, etc.

Participants will be instriucted in the use of the software pipeline and practice on three use cases:

FSCI 2023 class: E08 – Publishing from Collections Using Linked Open Data Sources and Computational Publishing Pipelines - https://osf.io/t4j5a/

Coordinated by Simon Worthington - NFDI4Culture @Open Science Lab, TIB, Hannover

FSCI 2023 instructors

August 2023

FigureWorkflow for use case #1 Paintings - see full view here.

Important links for the class

Other Helpful Information


Document DOI: To be confirmed | Author: Simon Worthington https://orcid.org/0000-0002-8579-9717 | CC BY-SA 4.0 International.

All software used is open-source OSI licence compliant. All content and other resources are open access with open licences.


To edit this document - request access by emailing simon.worthington@tib.eu.

Publication use cases

The example workflows have been put together by researchers from the two research consortia NFDI4Culture – German National Research Data Infrastructure, and COPIM (Community-led Open Publication Infrastructures for Monographs) in consultation with the publisher Open Book Publishers, Cambridge (UK).

Class activities and the use cases

Course learning objectives

At the end of the course, participants will be able to:

Course topics

This course will be presented over three days for 1½  hours each day and will cover these  topics:

Course schedule and acivities: Day 1-3

Note: Activities can be completed in advance of sessions or after sessions by participants.

Presentation details Day 1: Simon Worthington, and Simon Bowie. Instruction and guidance on installing and configuring the pipeline software. Introduction to using Wikidata as a source for a publications content and how to retrieve the content via an API with a Jupyter notebook. Instructions for the exercise of configuring the use case #1 ‘A painting exhibition catalog’ - participants can use the GitHub repo in their own time and carry out the steps in the Class Guide provided.

Activities for participants:

The results of these activities is will be to learn how to create a multi-format output using Wikidata, Jupyter Notebooks, and Quarto. You will transform ADA Painting Notebook to look like this example Fork.

Presentation details Day 2: Simon Worthington, Simon Bowie, and Janneke Adema.  Instructions for the exercise of configuring the use case #2  ‘A publishers book catalog’. A brief introduction to Thoth the book bibliographic and metadata service. An introduction to the experimental publishing work of COPIM and its successor project Open Book Futures.

Simon Worthington, Simon Bowie, and Janneke Adema will share insights and findings from the COPIM research on computational publishing - expecially the collaboration with Open Book Publishers and how to integrate computational publishing workflows with conventional publishers workflows.

Activities for participants:

Presentation details Day 3: Simon Worthington, and Simon Bowie, Hackathon Team semanticClimate. Review work on use case #1 and #2. Introduction to use case #3: ‘City Climate Change Plan Reader (experimental prototype)’ with team semanticClimate where their PyAmi pipeline can be introduced.

This use case uses Wikibase as its LOD source. Wikibase is a self-hosted open-source version of Wikidata. A mockup of use case #3 will be made with semanticClimate, we will review the mockup and look at what the challenges are for a later follow-up next phase prototype. The goals of this prototyping round are:

Activities for participants:

GitHub onboarding

Learning objective: Onboarding and familiarisation with using GitHub for publishing a website with GitHub Pages.

Use Benchmark repository as a sample repo to Fork, Clone, and turn on Github Pages: https://github.com/NFDI4Culture/ada-benchmark-notebook

GitHub support: https://support.github.com/

How GitHub works - YouTube video

Carry out the following steps

GitHub support links included.

System Installation

Ensure Git is installed and GitHub account has been created before completing install steps.

See section: 'GitHub onboarding'.

Use the Benchmark repository to test that your installation is functioning properly.

ADA Benchmark Notebook: https://github.com/NFDI4Culture/ada-benchmark-notebook

Support

Options for installation

For general purposes use the manual install.

The Docker install is for when you are running multiple environments on your computer or carrying out long term development.

Manual installation

Clone repository

To clone this repository from GitHub, ensure that Git is installed on your local machine either as a command line interface (https://git-scm.com/) or through GitHub Desktop (https://desktop.github.com/).

Use either the CLI or GitHub Desktop to clone the repository into your preferred installation directory.

If using CLI, navigate in the terminal to your preferred installation directory and run:

git clone https://github.com/repo-address

Install prerequisites (without Docker)

To install all prerequisites for running this repository on your local machine, please follow the instructions below.

First, install Python following the instructions at https://www.python.org/downloads/

Once Python is installed, navigate to the quarto_docker directory in terminal and run:

pip install -r requirements.txt

This should install all the required Python modules for running the Quarto rendering process.

Next, install the Quarto CLI following the instructions at https://quarto.org/docs/get-started/

Finally, install an environment for viewing and editing Jupyter Notebook files. This can be Visual Studio Code (https://code.visualstudio.com/), the open source fork VSCodium (https://vscodium.com/), or a dedicated Jupyter environment like JupyterLab (https://quarto.org/docs/get-started/hello/jupyter.html).

Troubleshooting

https://quarto.org/docs/get-started/

Quarto help docs: https://quarto.org/

Note: They miss out that you need Python installed and Jupyter Lab and a working terminal to install them

https://www.python.org/downloads/

https://jupyter.org/install

Also install Panda: py -m pip install panda

As well as knowing what the Python prompt and escape looks like: https://stackoverflow.com/questions/41524734/how-to-exit-python-script-in-command-prompt

If on Windows pip or python won't run. Check the solution here to add the python path for the Terminal (April '23).

Docker Installation

It's possible (though not required) to use Docker to run the environment for Jupyter Notebook running and Quarto rendering.

This process works in Linux but does not work in macOS due to a known issue. This involves Quarto not running properly in the Docker container in macOS due to the amd64 emulation of Docker Desktop for arm64 MacOS. See discussion at quarto-dev/quarto-cli#3308. This shouldn't occur in any other environment running Docker.

To run in Docker, first install Docker Desktop following the instructions at https://docs.docker.com/desktop/.

Once installed, navigate in the terminal to the directory for the cloned Git repository.

Run docker-compose up -d --build to start the containers.

The jupyterlab container runs a stand-alone version of JupyterLab on http://localhost:8888. This can be used to edit any Jupyter Notebook files in the repository. The JupyterLab instance runs with the password 'jupyterlab'.

The nginx container runs Nginx webserver and displays the static site that Quarto renders. This runs at http://localhost:1337.

The quarto container starts a Ubuntu 22.04 container, installs various things like Python, downloads Quarto and installs it, and then adds Python modules like jupyter, matplotlib, and panda. It then runs in the background so Quarto can be called on to render the qmd and ipynb files into the site/book like so:

docker exec -it quarto quarto render

When you're finished using the code, run docker-compose down to stop the containers.

Visual Studio Code installation

Install Visual Studio Code (https://code.visualstudio.com/)

In Extensions (Menu: View > Extensions) install Python, Jupyter Notebooks, and Quarto. Search for each term and click the install button and follow instructions.

Download the following repository ADA Benchmark Notebook: https://github.com/NFDI4Culture/ada-benchmark-notebook

Open the repository into a new Window in Visual Studio Code.

In the TERMINAL and run: pip install -r requirements.txt

Test Quarto in the TERMINAL and run: quarto check

[>] Checking Quarto installation......OK Version: 1.2.475 Jupyter: 4.11.2 Kernels: python3

[>] Checking Jupyter engine render....OK

[>] Checking R installation...........(None) Unable to locate an installed version of R. Install R from https://cloud.r-project.org/

Create a Wikidata query

Objective: To builds a Wikidata query. See example query: paintings, query link.

Notes: Wikidata Query (help)

Steps

  1. Go to the sample painting collection query here.

  2. Example perameters to change:

    1. Change the collection used

    1. Change the filter for subject depicted in the paintings

    1. Change the number to items retrieved

  1. Step-by-step instructions

    Note: You will use the lefthand 'Query helper' GUI, where you can type in names of items. Sometimes you need to enter a term twice to get the correct item to appear.

    1. Enable split view with i info button top left.

    2. Filter: instance of P31, painting Q3305213 - wdt:P31 wd:Q3305213.

    3. Filter: collection P195, Bavarian State Painting Collection Q812285 - wdt:P195 wd:Q812285. Change to a painting collection of your choice - see list of collections on Wikipedia.

    4. Filter: depicts P180, river Q4022 - change to your preference

    5. Limit, bottom left set limit to number of items returned.

    6. Play button - bottom left - renders query below

    7. Image grid view :-)

    8. Save your query: Options. Bottom right you will see a Link icon, clikc this and you can access a Short Link creator (but this is often not working). Alternatively you can copy the URL of your query from the browser address bar. Save the link in your browser bookmakr bar or in a text editor.

Transfering a Wikidata SPARQL query to a Juypter Notebook

A painting collection

We are using a pre-configured Jupyter Notebook that combines the SPARQL query and custom Python code. The Python code adds formatting to the SPARQL output to enable the Markdown rendering.

Wikidata's query builder cannot create all the Python needed for the Jupyter Notebook which means a Python coder to create the Python needed.

In this example we are only editing three perameters

  1. Collection

  2. Depicts

  3. Limit

The Wikidata query builder is only being used for previewing the query.

On line 24 and 25 of the Notebook you can change the collection and depiction codes to your updated IDs:

            wdt:P195 wd:Q160236;   # Part of the Metropolitan Museum of Art collection

            wdt:P180 wd:Q4022.    # Depicts 'river'

Change the limit on line 49:

            LIMIT 27

Save and run the Notebook cell to see the results displayed in the same way as the Wikidata query perview.

Save the Notebook.

Jupyter Notebooks: Setup, Editing, and Saving

General instructions for using Jupyter Notebooks.

Installation of Python and Jupyter Notebooks is covered in Quarto install section.

The default editor we use is Visual Code Studio, but you can use other Notebook editors.

Steps

First run: requirements.txt file for Python libraries. See: https://note.nkmk.me/en/python-pip-install-requirements/

Use:

# install with 'pip install -r requirements.txt'

Then you can run, edit, and save Notebooks.

Editing Quarto

Visual Studio Code is one option as an editor, but you can use any editor suite that you like.

Install Visual Studio Code (VSC)

https://code.visualstudio.com/

Editing

Load the whole repository folder into your editor.

The key file to edit in Quart is _quarto.yml. This file contains the main configurations for your publication.

If you are using VSC you can run Notebooks as well as use the Terminal to run Quarto commands, and commit to GitHub.

Change repo address and make repo link visible in publication

Edit _quarto.yml

If you are working on a fork the first thing you need to do is edit the repository address on line 19 - this will point the GitHub icon in your publication to your own GitHub repo.

repo-url: https://github.com/my-repository-address

Add or move a publication sections (chapters)

Edit _quarto.yml

To add new sections to your publication jast add file names to the chapter list after line 12.

The publication home page uses the file index.qmd

Adding multi-format output formats

Edit _quarto.yml

For books the following output formats can be added: HTML, PDF, MS Word, EPUB, AsciiDoc.

https://quarto.org/docs/books/

Note: epub needs a cover adding

epub:

cover-image: cover.png

Change theme (style)

Edit _quarto.yml

https://quarto.org/docs/output-formats/html-themes.html

Add metadata (HTML)

Edit _quarto.yml

https://quarto.org/docs/reference/formats/html.html

Other settings

The other settings can we read about on the Quarto support pages, for example Book Structure.

Quarto Rendering

Quarto help docs: https://quarto.org/

Using quarto to render as multi-format.

Use the command line to run Quarto: Powershell, GitBash, Cygwin, OSX shell, or terminal in Visual Studio Code ('VSC), etc.

Execute the Quarto commands from the top level of your publication repository.

Steps

Quarto rendering workflow. Courtesy Quarto.

Troubleshooting

Problems encountered so far:

  1. Cover image: file needs to be local for epub rendering

  2. Python path on Windows: If on Windows pip or python wont run. Check the solution here to add the python path for the Terminal April 23.

  3. CMD.EXE - Quarto wont run. See https://www.windows-faq.de/2017/02/27/unc-pfade-in-der-eingabeaufforderung-benutzen/ (no fix as yet - April 2023.

  4. Moving a repository locally: From Stackoverflow.

    If you are using GitHub Desktop, then just do the following steps:

    1. Close GitHub Desktop and all other applications with open files to your current directory path.

    2. Move the whole directory as mentioned above to the new directory location. (NB: The fdirectory has to be completely moved).

    3. Open GitHub Desktop and click on the blue (!) "repository not found" icon. Then a dialog will open and you will see a "Locate..." button which will open a popup allowing you to direct its path to a new location.

Software (open-source)

Over 2023/24 the computational components will be added to the ADA Semantic Publishing Pipeline as well as introducing Vivliostyle Create Book markdown renderer and swapping to Jupyter Book computational book platform away from Quarto – https://github.com/NFDI4Culture/ada

AI Software (open-source)

To be confirmed: