Project: Linked Open Exhibition — NFDI4Culture / Hochschule Hannover (BIM-126-02)
AI attribution: GitHub Copilot (Claude Sonnet 4.6)
Licence (DNB data): CC0 1.0 Universal — see dnb.de/metadataservice
Purpose: Count and retrieve all DNB book records related to Sprengel Museum Hannover and save them as raw MARC21-XML files.
Background: SRU
SRU (Search/Retrieve via URL) is an OASIS/Library of Congress standard protocol for performing bibliographic searches over HTTP. You send a URL with query parameters; the response is XML.
DNB SRU endpoint: https://services.dnb.de/sru/dnb
Key parameters:
operation
searchRetrieve
perform a search
query
Sprengel Museum and mat=books
keyword search, books only
recordSchema
MARC21-xml
return MARC21-XML records
maximumRecords
100
up to 100 records per request
startRecord
1, 101, 201, …
pagination offset
This is the same search as the DNB portal: portal.dnb.de — Sprengel Museum books
All DNB bibliographic metadata is released under CC0 1.0 Universal — it may be freely reused without attribution.
import requests
import time
import xml.etree.ElementTree as ET
from pathlib import Path
# Output directory for raw XML files
OUTPUT_DIR = Path("../sprengel_raw" )
OUTPUT_DIR.mkdir(parents= True , exist_ok= True )
SRU_BASE = "https://services.dnb.de/sru/dnb"
QUERY = "Sprengel Museum and mat=books" # simple keyword search, books only
MAX_RECORDS = 100
print (f"Output directory : { OUTPUT_DIR. resolve()} " )
print (f"Query : { QUERY} " )
Output directory : C:\git\linked-open-exhibition\catalogues\sprengel_raw
Query : Sprengel Museum and mat=books
Step 1 — Count total records
Before downloading anything, ask the DNB how many records the query returns. This is exactly the same number shown on the DNB portal search .
params = {
"operation" : "searchRetrieve" ,
"version" : "1.1" ,
"query" : QUERY,
"recordSchema" : "MARC21-xml" ,
"maximumRecords" : 1 , # we only need the count for now
"startRecord" : 1 ,
}
resp = requests.get(SRU_BASE, params= params, timeout= 30 )
resp.raise_for_status()
root = ET.fromstring(resp.text)
ns_srw = {"srw" : "http://www.loc.gov/zing/srw/" }
total = int (root.findtext("srw:numberOfRecords" , default= "0" , namespaces= ns_srw))
pages = (total + MAX_RECORDS - 1 ) // MAX_RECORDS
print (f"Total records matching query : { total} " )
print (f"Pages to retrieve (100/page) : { pages} " )
Total records matching query : 582
Pages to retrieve (100/page) : 6
Step 2 — Retrieve all pages and save as XML
Each page of 100 records is saved as a separate file: sprengel_raw/page_001.xml, page_002.xml, etc.
A 1-second pause between requests avoids overloading the DNB server.
for page in range (pages):
start = page * MAX_RECORDS + 1
params = {
"operation" : "searchRetrieve" ,
"version" : "1.1" ,
"query" : QUERY,
"recordSchema" : "MARC21-xml" ,
"maximumRecords" : MAX_RECORDS,
"startRecord" : start,
}
resp = requests.get(SRU_BASE, params= params, timeout= 60 )
resp.raise_for_status()
out_file = OUTPUT_DIR / f"page_ { page+ 1 :03d} .xml"
out_file.write_text(resp.text, encoding= "utf-8" )
print (f"Saved page { page+ 1 } / { pages} → { out_file. name} (startRecord= { start} )" )
if page < pages - 1 :
time.sleep(1 ) # be polite to the DNB server
print (" \n All pages retrieved." )
Saved page 1/6 → page_001.xml (startRecord=1)
Saved page 2/6 → page_002.xml (startRecord=101)
Saved page 3/6 → page_003.xml (startRecord=201)
Saved page 4/6 → page_004.xml (startRecord=301)
Saved page 5/6 → page_005.xml (startRecord=401)
Saved page 6/6 → page_006.xml (startRecord=501)
All pages retrieved.
Step 3 — Inspect the first record
Print the first MARC21-XML record from page 1 to verify the data looks correct.
first_page = OUTPUT_DIR / "page_001.xml"
root = ET.parse(first_page).getroot()
ns_srw = "http://www.loc.gov/zing/srw/"
ns_marc = "http://www.loc.gov/MARC21/slim"
records = root.findall(f".// {{ { ns_srw} }} recordData/ {{ { ns_marc} }} record" )
print (f"Records in first page: { len (records)} " )
# Print a summary of the first record
rec = records[0 ]
idn = rec.findtext(f" {{ { ns_marc} }} controlfield[@tag='001']" ) or "—"
title = rec.findtext(f".// {{ { ns_marc} }} datafield[@tag='245']/ {{ { ns_marc} }} subfield[@code='a']" ) or "—"
year = (rec.findtext(f".// {{ { ns_marc} }} datafield[@tag='264']/ {{ { ns_marc} }} subfield[@code='c']" ) or
rec.findtext(f".// {{ { ns_marc} }} datafield[@tag='260']/ {{ { ns_marc} }} subfield[@code='c']" ) or "—" )
print (f" \n First record: \n IDN : { idn} \n Title: { title} \n Year : { year} " )
Records in first page: 100
First record:
IDN : 1375818457
Title: Niki de Saint Phalle - Die Grotte
Year : 2026
Next step: Run 02_dnb_filter_exhibitions.ipynb to parse these XML files, filter for exhibition catalogues, and extract structured fields into a CSV.