Notebook 01 — DNB SRU Search

Project: Linked Open Exhibition — NFDI4Culture / Hochschule Hannover (BIM-126-02)
AI attribution: GitHub Copilot (Claude Sonnet 4.6)
Licence (DNB data): CC0 1.0 Universal — see dnb.de/metadataservice

Purpose: Count and retrieve all DNB book records related to Sprengel Museum Hannover and save them as raw MARC21-XML files.

Background: SRU

SRU (Search/Retrieve via URL) is an OASIS/Library of Congress standard protocol for performing bibliographic searches over HTTP. You send a URL with query parameters; the response is XML.

DNB SRU endpoint: https://services.dnb.de/sru/dnb

Key parameters:

Parameter	Value	Meaning
`operation`	`searchRetrieve`	perform a search
`query`	`Sprengel Museum and mat=books`	keyword search, books only
`recordSchema`	`MARC21-xml`	return MARC21-XML records
`maximumRecords`	`100`	up to 100 records per request
`startRecord`	`1`, `101`, `201`, …	pagination offset

This is the same search as the DNB portal: portal.dnb.de — Sprengel Museum books

All DNB bibliographic metadata is released under CC0 1.0 Universal — it may be freely reused without attribution.

import requests
import time
import xml.etree.ElementTree as ET
from pathlib import Path

# Output directory for raw XML files
OUTPUT_DIR = Path("../sprengel_raw")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

SRU_BASE    = "https://services.dnb.de/sru/dnb"
QUERY       = "Sprengel Museum and mat=books"   # simple keyword search, books only
MAX_RECORDS = 100

print(f"Output directory : {OUTPUT_DIR.resolve()}")
print(f"Query            : {QUERY}")

Output directory : C:\git\linked-open-exhibition\catalogues\sprengel_raw
Query            : Sprengel Museum and mat=books

Step 1 — Count total records

Before downloading anything, ask the DNB how many records the query returns. This is exactly the same number shown on the DNB portal search.

params = {
    "operation": "searchRetrieve",
    "version": "1.1",
    "query": QUERY,
    "recordSchema": "MARC21-xml",
    "maximumRecords": 1,   # we only need the count for now
    "startRecord": 1,
}

resp = requests.get(SRU_BASE, params=params, timeout=30)
resp.raise_for_status()

root = ET.fromstring(resp.text)
ns_srw = {"srw": "http://www.loc.gov/zing/srw/"}
total  = int(root.findtext("srw:numberOfRecords", default="0", namespaces=ns_srw))

pages = (total + MAX_RECORDS - 1) // MAX_RECORDS
print(f"Total records matching query : {total}")
print(f"Pages to retrieve (100/page) : {pages}")

Total records matching query : 582
Pages to retrieve (100/page) : 6

Step 2 — Retrieve all pages and save as XML

Each page of 100 records is saved as a separate file: sprengel_raw/page_001.xml, page_002.xml, etc.

A 1-second pause between requests avoids overloading the DNB server.

for page in range(pages):
    start = page * MAX_RECORDS + 1
    params = {
        "operation": "searchRetrieve",
        "version": "1.1",
        "query": QUERY,
        "recordSchema": "MARC21-xml",
        "maximumRecords": MAX_RECORDS,
        "startRecord": start,
    }
    resp = requests.get(SRU_BASE, params=params, timeout=60)
    resp.raise_for_status()

    out_file = OUTPUT_DIR / f"page_{page+1:03d}.xml"
    out_file.write_text(resp.text, encoding="utf-8")
    print(f"Saved page {page+1}/{pages} → {out_file.name} (startRecord={start})")

    if page < pages - 1:
        time.sleep(1)  # be polite to the DNB server

print("\nAll pages retrieved.")

Saved page 1/6 → page_001.xml (startRecord=1)
Saved page 2/6 → page_002.xml (startRecord=101)
Saved page 3/6 → page_003.xml (startRecord=201)
Saved page 4/6 → page_004.xml (startRecord=301)
Saved page 5/6 → page_005.xml (startRecord=401)
Saved page 6/6 → page_006.xml (startRecord=501)

All pages retrieved.

Step 3 — Inspect the first record

Print the first MARC21-XML record from page 1 to verify the data looks correct.

first_page = OUTPUT_DIR / "page_001.xml"
root = ET.parse(first_page).getroot()

ns_srw  = "http://www.loc.gov/zing/srw/"
ns_marc = "http://www.loc.gov/MARC21/slim"

records = root.findall(f".//{{{ns_srw}}}recordData/{{{ns_marc}}}record")
print(f"Records in first page: {len(records)}")

# Print a summary of the first record
rec = records[0]
idn   = rec.findtext(f"{{{ns_marc}}}controlfield[@tag='001']") or "—"
title = rec.findtext(f".//{{{ns_marc}}}datafield[@tag='245']/{{{ns_marc}}}subfield[@code='a']") or "—"
year  = (rec.findtext(f".//{{{ns_marc}}}datafield[@tag='264']/{{{ns_marc}}}subfield[@code='c']") or
         rec.findtext(f".//{{{ns_marc}}}datafield[@tag='260']/{{{ns_marc}}}subfield[@code='c']") or "—")

print(f"\nFirst record:\n  IDN  : {idn}\n  Title: {title}\n  Year : {year}")

Records in first page: 100

First record:
  IDN  : 1375818457
  Title: Niki de Saint Phalle - Die Grotte
  Year : 2026

Next step: Run 02_dnb_filter_exhibitions.ipynb to parse these XML files, filter for exhibition catalogues, and extract structured fields into a CSV.