import pandas as pd
from pathlib import Path
csv_path = Path("catalogues/sprengel_exhibitions.csv")
df = pd.read_csv(csv_path)
print(f"Records: {len(df)}")
df.head(10)Data Analysis
DNB Pipeline — From Raw Records to Wikibase
This page documents and explains the full pipeline from DNB retrieval to Wikibase publication.
Set eval: true in the YAML header (and ensure the .env file and pipeline outputs are present) to run the code live during quarto render.
Pipeline overview
DNB SRU API
│ (MARC21-XML)
▼
01_dnb_search.ipynb → sprengel_raw/page_*.xml
│
▼
02_dnb_filter_exhibitions.ipynb → sprengel_exhibitions.csv
│
├──▶ 03_dnb_cover_images.ipynb → images/*.jpg
│
▼
04_wikibase_data_model.ipynb → wikibase_property_map.json
│
▼
05_wikibase_upload.ipynb → Wikibase items
│
▼
06_mediawiki_upload.ipynb → MediaWiki files
Step 1 — Load the CSV
After running Notebooks 01 and 02, the exhibition records are available as a CSV file.
Step 2 — Year distribution
import matplotlib.pyplot as plt
year_counts = df["year"].value_counts().sort_index()
year_counts.plot(kind="bar", figsize=(14, 4), title="Exhibition catalogues per year")
plt.xlabel("Year")
plt.ylabel("Number of catalogues")
plt.tight_layout()
plt.show()Step 3 — Records with ISBN (cover image candidates)
has_isbn = df["isbn"].notna() & (df["isbn"] != "")
print(f"Records with ISBN: {has_isbn.sum()} / {len(df)}")Step 4 — SPARQL summary from Wikibase
After running Notebooks 04 and 05, the data is in Wikibase. Query to count uploaded items:
import os, requests
from dotenv import load_dotenv
load_dotenv(Path(".env"))
WB_URL = os.getenv("WB_URL", "https://wikibase.wbworkshop.tibwiki.io")
SPARQL_URL = os.getenv("SPARQL_URL", "https://query.wbworkshop.tibwiki.io/sparql")
# Use the confirmed-working count query (wdt: prefix not auto-resolved on local Wikibase)
SPARQL = """
SELECT (COUNT(?item) AS ?count) WHERE {
?item ?p ?o .
}
"""
try:
resp = requests.get(
SPARQL_URL,
params={"query": SPARQL, "format": "json"},
headers={"Accept": "application/sparql-results+json"},
timeout=20,
)
resp.raise_for_status()
count = resp.json()["results"]["bindings"][0]["count"]["value"]
print(f"Items in Wikibase: {count}")
except Exception as e:
print(f"SPARQL query failed: {e}")Notebooks
The full documented notebooks are in catalogues/dnb-jupyter/:
| Notebook | Purpose |
|---|---|
| 00_wikibase_bot_test.ipynb | Test bot account credentials |
| 00_sparql_health_check.ipynb | Test SPARQL health check |
| 01_dnb_search.ipynb | Retrieve DNB records via SRU |
| 02_dnb_filter_exhibitions.ipynb | Parse MARC21-XML, filter, export CSV |
| 03_dnb_cover_images.ipynb | Retrieve cover images |
| 04_wikibase_data_model.ipynb | Upload data model to Wikibase |
| 05_wikibase_upload.ipynb | Upload exhibition records |
| 06_mediawiki_upload.ipynb | Upload cover images to MediaWiki |