SearchFireSafety

Introduction

Retrieval-augmented generation (RAG) promises to bridge complex legal statutes and public understanding, yet hallucination remains a critical barrier in real-world use. Because statutes evolve and provisions frequently cross-reference, maintaining temporal currency and citation awareness is essential, favoring up-to-date sources over static parametric memory. To study these issues, we focus on the under-examined domain of South Korean fire safety regulation—a complex web of fragmented legislation, dense cross-references, and vague decrees. We introduce SearchFireSafety, the first RAG-oriented question-answering (QA) resource for this domain. It includes: (i) 941 real-world, open-ended QA pairs from public inquiries (2023–2025); (ii) a corpus of 4,437 legal documents from 117 statutes with a citation graph; and (iii) synthetic single-hop (Yes/No) and multi-hop (MCQA) benchmarks targeting legal reasoning and uncertainty.

Experiments with four retrieval strategies and five Korean-capable LLMs show that: (1) multilingual dense retrievers excel due to the domain's mix of Korean, English loanwords, and Sino-Korean terms (i.e., Chinese characters); (2) grounding LLMs with SearchFireSafety substantially improves factual accuracy; but (3) multi-hop reasoning still fails to resolve conflicting provisions or recognize informational gaps. Our results affirm that RAG is necessary but not yet sufficient for legal QA, and we offer SearchFireSafety as a rigorous testbed to drive progress in Legal AI.

Data Format

1. Question-Answering Dataset (/data/qna.jsonl)

Key	Example	Notes
`question_id`	`0`	Unique identifier.
`qna_doc_id`	`1AA‑2304‑0225487-…`	Government archive ID.
`original_text`	Full civil‑petition reply text.
`law_references`	`["소방시설 … 시행령 별표4", …]`	Manually extracted citations.
`question_raw`	Original citizen question.
`question`	GPT-4o Cleansed question text.
`answer`	Gold answer by NFA official.
`matched_doc_id`	`[2, 1259, 1292]`	IDs of supporting statute chunks.
`semantic_ids`	Machine‑readable statute IDs.
`has_matched_docs`	Whether supporting docs were found.

2. Legal Documents (/data/doc.jsonl)

Key	Example	Explanation
`doc_id`	`0`	Sequential integer.
`semantic_id`	`NFPC‑101_1조`	Law‑code + article slug.
`collection_name`	`소방시설 설치 및 관리에 관한 법률`	Parent statute.
`law_level`	`행정규칙`	Law level (법률 / 시행령 / 행정규칙 etc.)
`law_name`	Full statute title.
`chapter`, `chapter_description`, `chapter_body`	Text content.
`deleted`	`false`	`true` if the provision is repealed.
`related_chapters`	Cross‑links to related statutes.
`matched_doc_id_merged`	[1291, 1292]	IDs of related document ids

Evaluation Prompts

Tutorial

1. Crawl Korean legislation (`Crawl_Law.ipynb`)

Open the notebook and replace url_list with 법령 페이지 from https://www.law.go.kr/LSW/main.html. Run all cells—each URL is downloaded, parsed chapter‑by‑chapter, and the result is printed:

url_list = [
    "https://www.law.go.kr/법령/소방기본법",
    "https://www.law.go.kr/법령/주택법",
]

Example output

[OK]  https://www.law.go.kr/...소방기본법  →  132 item(s)

The notebook writes a newline‑delimited JSON (*.jsonl) file containing:

{"doc_id":0, "semantic_id":"소방기본법 제1장 1조", "chapter_body":"..."}

2. Evaluate retrievers (`retrieval_eval.py`)

Benchmark TF‑IDF, BM25, BGE‑m3 (or any dense retreivers supported by SentenceTransformer in HuggingFace), DPR (or any subset) on a docs / queries pair.

python retrieval_eval.py \
    --docs    data/law_docs.jsonl \
    --queries data/train_queries.jsonl \
    --methods tfidf,bm25,bge,dpr \
    --topk 100 \
    --expand_links            # RAG‑style link expansion
    --device cuda:0           # cpu / cuda:<id> / mps

Key flags	Purpose
`--tfidf_max_features`	vocabulary size (default 120 000)
`--bm25_k1`, `--bm25_b`	BM25 hyper‑params
`--bge_model_name`	any Sentence‑Transformer (default `BAAI/bge-m3`)
`--dpr_context_encoder_path`, `--dpr_question_encoder_path`	local DPR checkpoints
`--batch_size`	GPU memory vs. speed trade‑off

The script prints per‑method Recall@{1,2,3,5,10,20,100} & MRR, then saves ir_metrics.csv.

3. Generate answers (`inference.py`)

Run one or many LLMs with optional retrieval augmentation.

python inference.py \
    --docs    data/doc.jsonl \
    --queries data/dev_queries.jsonl \
    --retriever bge          \        # tfidf / bm25 / bge
    --topk 5                 \        # 0 = no RAG
    --models  LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct \
              openai:gpt-4o \
    --out_dir rag_outputs \
    --max_new_tokens 8192 \
    --expand_with_links       \
    --openai_key_file ~/.openai_key

Open‑source models are loaded via AutoModelForCausalLM; OpenAI models are indicated by the prefix openai:. --oracle makes the generator see gold document ids instead of retrieved ones (upper‑bound).

Each model creates one JSONL file:

rag_outputs/
  exaone-3.5-7.8b-instruct_bge_k5.jsonl
  gpt-4o_bge_k5.jsonl

Every line copies the original query row and adds:

"model_answer": "..."

4. Clean generations (`data_postprocessing.py`)

Useful for LLMs that wrap answers with proprietary tags (e.g. <think> … </think>).

python data_postprocessing.py rag_outputs/llama-3-70b_bge_k5.jsonl \
    --inplace           # overwrite file
    --strip-think       # drop <think> blocks
    --extra-regex "<noise>.*?</noise>"

Option	Explanation
`--strip-prompts` / `--no-strip-prompts`	remove `assistant :` etc.
`--new-field cleaned_answer`	keep raw & add cleaned field instead of overwrite
`--backup`	save `.bak` before modifying

5. Score answers (`eval.py`)

Compute ROUGE, BERTScore, LLM Judge, and Win‑Rate (model vs. gold) in one go.

python eval.py \
    --input  rag_outputs/exaone-3.5-7.8b-instruct_bge_k5.jsonl \
    --output results/exaone-3.5-7.8b-instruct_bge_k5_scores.jsonl \
    --metrics bert,rouge,llm,winrate \
    --oracle_docs data/doc.jsonl \
    --openai_api_key $OPENAI_API_KEY

Metric flag	Description
`bert`	`beomi/kcbert-base` on faithfulness & similarity
`rouge`	Rouge‑1/2/L/Lsum (stemming)
`llm`	LLM discrete pass/fail (0 or 1)
`winrate`	LLM pairwise A/B comparison

The script appends metric columns to each row, writes a new JSONL, and prints corpus‑level means:

Dataset-mean ▶ BERTScore=0.8423  ·  ROUGE-1 F1=0.6710  ·  LLM‑Score=0.79  ·  Win‑Rate=0.62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SearchFireSafety

Introduction

Data Format

1. Question-Answering Dataset (/data/qna.jsonl)

2. Legal Documents (/data/doc.jsonl)

Evaluation Prompts

Tutorial

1. Crawl Korean legislation (`Crawl_Law.ipynb`)

2. Evaluate retrievers (`retrieval_eval.py`)

3. Generate answers (`inference.py`)

4. Clean generations (`data_postprocessing.py`)

5. Score answers (`eval.py`)

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
.gitignore		.gitignore
Crawl_Law.ipynb		Crawl_Law.ipynb
README.md		README.md
data_postprocessing.py		data_postprocessing.py
eval.py		eval.py
inference.py		inference.py
retrieval_eval.py		retrieval_eval.py
retrieval_eval_v2.py		retrieval_eval_v2.py

qqplot/SearchFireSafety

Folders and files

Latest commit

History

Repository files navigation

SearchFireSafety

Introduction

Data Format

1. Question-Answering Dataset (/data/qna.jsonl)

2. Legal Documents (/data/doc.jsonl)

Evaluation Prompts

Tutorial

1. Crawl Korean legislation (Crawl_Law.ipynb)

2. Evaluate retrievers (retrieval_eval.py)

3. Generate answers (inference.py)

4. Clean generations (data_postprocessing.py)

5. Score answers (eval.py)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

1. Crawl Korean legislation (`Crawl_Law.ipynb`)

2. Evaluate retrievers (`retrieval_eval.py`)

3. Generate answers (`inference.py`)

4. Clean generations (`data_postprocessing.py`)

5. Score answers (`eval.py`)

Packages