Scraping und Indexierung der Bundesgerichtshof Entscheidungsdatenbank

In this article we take a look into the workflow of scraping, pre-processing, and indexing a custom dataset for use in search applications like Mosaic.

We use the Entscheidungsdatenbank of the German Bundesgerichtshof for that task.

Scraping

The site features a paginated table, with each page showing 10 entries. We observe the last number in the URL changing predictably when we switch pages:

Page 2: https://www.bundesgerichtshof.de/SiteGlobals/Forms/Suche/EntscheidungssucheBGH_Formular.html?gtp=19723396_list%253D2

Page 5: https://www.bundesgerichtshof.de/SiteGlobals/Forms/Suche/EntscheidungssucheBGH_Formular.html?gtp=19723396_list%253D5

By inspecting the underlying HTML code we notice the simple structure of the table:

  <tr>
    <td>
        1. Strafsenat</td>
    <td>17.09.2025</td>
    <td>05.02.2026</td>
    <td>1 StR 324/25</td>
    <td>
        <p class="more">
            <a href="/SharedDocs/Entscheidungen/DE/Strafsenate/1_StS/2025/1_StR_324-25.pdf?__blob=publicationFile&amp;v=1" target="_blank" title="">
              1 StR 324/25</a>
        </p>

    </td>
</tr>

We easily find the Senat, some dates, the Aktenzeichen, and finally, the Therefore, we can write a simple script that iterates over all pages and link to the PDF documents we want.

baseurl = 'https://www.bundesgerichtshof.de/SiteGlobals/Forms/Suche/EntscheidungssucheBGH_Formular.html?gtp=19723396_list='
start_index = 1
end_index = 8230

pattern = r'<tr>\s*<td>\s*(?P<senat>.*?)\s*</td>\s*<td>\s*(?P<datum>\d{2}\.\d{2}\.\d{4})\s*</td>\s*<td>.*?</td>\s*<td>\s*(?P<aktenzeichen>.*?)\s*</td>\s*<td>.*?href="(?P<pdf_link>[^"]+?\.pdf[^"]*?)".*?</td>\s*</tr>'

results = []

for index in range(start_index, end_index + 1):
    # 1. Fetch the raw page content
    url = f"{baseurl}{index}"
    response = requests.get(url)
    
    # 2. Parse the HTML using your Regex pattern
    # re.DOTALL is important here to match across newlines in the HTML
    matches = re.finditer(pattern, response.text, re.DOTALL)

    for match in matches:
        data = match.groupdict()
        
        # 3. Clean data and construct the absolute URL
        results.append({
            'Senat': data['senat'].strip(),
            'Datum': data['datum'].strip(),
            'Aktenzeichen': data['aktenzeichen'].strip(),
            'url': "https://www.bundesgerichtshof.de" + data['pdf_link']
        })

This script collects a list of individual PDF documents. To make these documents searchable, we download them and extract the plain text.

Downloading

For downloading the individual documents, we loop over the results list from the previous script, apply this function to each URL and store the resulting plain text along with each document.

def fetch_and_extract_text(pdf_url):
    response = requests.get(pdf_url)
    
    # Process PDF directly from RAM using io.BytesIO
    with pdfplumber.open(io.BytesIO(response.content)) as pdf:
        full_text = [page.extract_text() for page in pdf.pages]
        
    # Join pages and strip whitespace
    return "\n".join(filter(None, full_text)).strip()

The plain text gets extracted by the pdfplumber package, so this step is pretty easy.

Pre-processing

We store plain texts along with metadata in a Python list. For processing our data further, we need to write this data to a file, preferably a .parquet file. For all intents and purposes, a .parquet file is a lot like a .csv file: they both can contain one table of data. However, .parquet stores the data in binary and compresses it, to make it suitable for huge datasets. This makes .parquet an important file format for tasks like this.

For indexing, we need to add some metadata to our dataset: a unique id for each document, and the language of the plain text. For this dataset, all documents are in German (language code deu).

import pandas as pd 

df = pd.DataFrame(results)

# Mosaic requires a unique index for each documents
df['id'] = "bgh_" + df.index.astype(str) 

# Mosaic indices usually contain the language metadata
df['language'] = 'deu' 

df.to_parquet("bgh.parquet", index=False)

We create the .parquet file containing the plain texts along with lots of metadata, which makes us ready to start indexing.

Indexing

To search through millions of documents without iterating over each word in every document, we create an inverted index. This data structure contains a mapping between every word in our dataset (each word is a key in our map), and the list of documents that contain that word.

Performing a search on an inverted index takes a constant amount of time, not depending on the amount of documents in the dataset: we perform a look up of the desired key in our inverted index and get a subset of documents we need to consider. Within that subset we can use an algorithm like BM25 to rank the most relevant documents.

Performing a search on indexed data is handled by Mosaic though Apache Lucene.

For indexing, we use the OpenWebSearch.eu fork of the Apache Spark indexer with a single docker run command:

sudo docker run --rm -v "$PWD/tmp":/tmp -v "$PWD/data":/data:Z -v "$PWD/spark-properties.conf":/opt/spark/conf/spark-defaults.conf opencode.it4i.eu:5050/openwebsearcheu-public/open-web-indexer --description "Bundesgerichtshof-Urteildatenbank" --assign-identifiers --input-format parquet --output-format ciff --id-col id --content-col plain_text /data/bgh.parquet /data/index/

For this command to work, we need to set up the following directories and files: tmp/, data/, spark-properties.conf. The data/ directory contains .parquet file we generated earlier, and an index/ directory where Spark will place the final index. This index/ directory should NOT exist before starting the indexing process, otherwise the process will fail at the end.

The spark-properties.conf file sets the memory limits for the indexer. Adjust these based on your available memory.

spark.driver.memory=900g
spark.executor.memory=900g

When the Spark indexer completed successfully, the final index consists of the existing data/bgh.parquet file and the generated data/index/index.ciff.gz file. To use that index in Mosaic, but both of these files in a folder, put this folder into the Mosaic resources directory and start Mosaic’s build process.