Scraping und Indexierung der Bundesgerichtshof Entscheidungsdatenbank
In this article we take a look into the workflow of scraping, pre-processing, and indexing a custom dataset for use in search applications like Mosaic.
We use the Entscheidungsdatenbank of the German Bundesgerichtshof for that task.
Scraping
The site features a paginated table, with each page showing 10 entries. We observe the last number in the URL changing predictably when we switch pages:
Page 2: https://www.bundesgerichtshof.de/SiteGlobals/Forms/Suche/EntscheidungssucheBGH_Formular.html?gtp=19723396_list%253D2
Page 5:
https://www.bundesgerichtshof.de/SiteGlobals/Forms/Suche/EntscheidungssucheBGH_Formular.html?gtp=19723396_list%253D5
By inspecting the underlying HTML code we notice the simple structure of the table:
<tr>
<td>
1. Strafsenat</td>
<td>17.09.2025</td>
<td>05.02.2026</td>
<td>1 StR 324/25</td>
<td>
<p class="more">
<a href="/SharedDocs/Entscheidungen/DE/Strafsenate/1_StS/2025/1_StR_324-25.pdf?__blob=publicationFile&v=1" target="_blank" title="">
1 StR 324/25</a>
</p>
</td>
</tr>
We easily find the Senat, some dates, the Aktenzeichen, and finally, the Therefore, we can write a simple script that iterates over all pages and link to the PDF documents we want.
baseurl = 'https://www.bundesgerichtshof.de/SiteGlobals/Forms/Suche/EntscheidungssucheBGH_Formular.html?gtp=19723396_list='
start_index = 1
end_index = 8230
pattern = r'<tr>\s*<td>\s*(?P<senat>.*?)\s*</td>\s*<td>\s*(?P<datum>\d{2}\.\d{2}\.\d{4})\s*</td>\s*<td>.*?</td>\s*<td>\s*(?P<aktenzeichen>.*?)\s*</td>\s*<td>.*?href="(?P<pdf_link>[^"]+?\.pdf[^"]*?)".*?</td>\s*</tr>'
results = []
for index in range(start_index, end_index + 1):
# 1. Fetch the raw page content
url = f"{baseurl}{index}"
response = requests.get(url)
# 2. Parse the HTML using your Regex pattern
# re.DOTALL is important here to match across newlines in the HTML
matches = re.finditer(pattern, response.text, re.DOTALL)
for match in matches:
data = match.groupdict()
# 3. Clean data and construct the absolute URL
results.append({
'Senat': data['senat'].strip(),
'Datum': data['datum'].strip(),
'Aktenzeichen': data['aktenzeichen'].strip(),
'url': "https://www.bundesgerichtshof.de" + data['pdf_link']
})
This script collects a list of individual PDF documents. To make these documents searchable, we download them and extract the plain text.
Downloading
For downloading the individual documents, we loop over the results list from the previous script, apply this function to each URL and store the resulting plain text along with each document.
def fetch_and_extract_text(pdf_url):
response = requests.get(pdf_url)
# Process PDF directly from RAM using io.BytesIO
with pdfplumber.open(io.BytesIO(response.content)) as pdf:
full_text = [page.extract_text() for page in pdf.pages]
# Join pages and strip whitespace
return "\n".join(filter(None, full_text)).strip()
The plain text gets extracted by the pdfplumber package, so this step is pretty easy.
Pre-processing
We store plain texts along with metadata in a Python list. For processing our data further, we need to write this data to a file, preferably a .parquet file. For all intents and purposes, a .parquet file is a lot like a .csv file: they both can contain one table of data. However, .parquet stores the data in binary and compresses it, to make it suitable for huge datasets. This makes .parquet an important file format for tasks like this.
For indexing, we need to add some metadata to our dataset: a unique id for each document, and the language of the plain text. For this dataset, all documents are in German (language code deu).
import pandas as pd
df = pd.DataFrame(results)
# Mosaic requires a unique index for each documents
df['id'] = "bgh_" + df.index.astype(str)
# Mosaic indices usually contain the language metadata
df['language'] = 'deu'
df.to_parquet("bgh.parquet", index=False)
We create the .parquet file containing the plain texts along with lots of metadata, which makes us ready to start indexing.
Indexing
To search through millions of documents without iterating over each word in every document, we create an inverted index. This data structure contains a mapping between every word in our dataset (each word is a key in our map), and the list of documents that contain that word.
Performing a search on an inverted index takes a constant amount of time, not depending on the amount of documents in the dataset: we perform a look up of the desired key in our inverted index and get a subset of documents we need to consider.
Within that subset we can use an algorithm like BM25 to rank the most relevant documents.
Performing a search on indexed data is handled by Mosaic though Apache Lucene.
For indexing, we use the OpenWebSearch.eu fork of the Apache Spark indexer with a single docker run command:
sudo docker run --rm -v "$PWD/tmp":/tmp -v "$PWD/data":/data:Z -v "$PWD/spark-properties.conf":/opt/spark/conf/spark-defaults.conf opencode.it4i.eu:5050/openwebsearcheu-public/open-web-indexer --description "Bundesgerichtshof-Urteildatenbank" --assign-identifiers --input-format parquet --output-format ciff --id-col id --content-col plain_text /data/bgh.parquet /data/index/
For this command to work, we need to set up the following directories and files: tmp/, data/, spark-properties.conf. The data/ directory contains .parquet file we generated earlier, and an index/ directory where Spark will place the final index. This index/ directory should NOT exist before starting the indexing process, otherwise the process will fail at the end.
The spark-properties.conf file sets the memory limits for the indexer. Adjust these based on your available memory.
spark.driver.memory=900g
spark.executor.memory=900g
When the Spark indexer completed successfully, the final index consists of the existing
data/bgh.parquet file and the generated data/index/index.ciff.gz file.
To use that index in Mosaic, but both of these files in a folder, put this folder into the Mosaic resources directory and start Mosaic’s build process.