Skip to content

Commit 8a285de

Browse files
authored
Create 2025-2-25-AutoGrep-Automated-Generation-and-Filtering-of-Semgrep-Rules-from-Vulnerability-Patches.md
1 parent 068a8f3 commit 8a285de

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
layout: post
3+
title: "AutoGrep: Automated Generation and Filtering of Semgrep Rules from Vulnerability Patches"
4+
---
5+
## Abstract
6+
7+
This article presents [AutoGrep](https://github.com/lambdasec/autogrep), an automated system for generating and filtering high-quality security rules for static analysis tools. Motivated by recent licensing changes in the Semgrep ecosystem, AutoGrep addresses the critical need for maintaining and expanding permissively licensed security rules. By leveraging Large Language Models (LLMs) and a multi-stage filtering pipeline, AutoGrep transforms vulnerability patches into precise, generalizable security rules while eliminating duplicates and overly specific patterns.
8+
9+
## 1. Introduction
10+
11+
### 1.1 Background
12+
13+
Static Analysis Security Testing (SAST) tools play a crucial role in modern software security. Semgrep, a popular open-source SAST tool, has gained widespread adoption due to its effectiveness and extensive rule set. However, recent changes in Semgrep's licensing model have created challenges for the security community, leading to the emergence of alternative solutions like OpenGrep.
14+
15+
### 1.2 Motivation
16+
17+
The recent transition of Semgrep's official rules to non-permissive licensing has created a significant gap in the open-source security ecosystem. This change has prompted the creation of OpenGrep, a community fork supported by security vendors, highlighting the need for permissively licensed security rules. Traditional manual rule curation is time-consuming and requires constant maintenance to remain effective.
18+
19+
### 1.3 Contributions
20+
21+
We make the following contributions:
22+
23+
1. Introduces an automated pipeline for generating Semgrep rules from vulnerability patches
24+
2. Presents a novel filtering system that ensures rule quality and generalizability
25+
3. Demonstrates the effectiveness of using LLMs for security rule generation
26+
4. Provides a significant set of permissively licensed security rules across multiple languages
27+
28+
## 2. System Architecture
29+
30+
### 2.1 Overview
31+
32+
AutoGrep consists of two main components:
33+
34+
1. **Rule Generation Pipeline**: Analyzes vulnerability patches and generates corresponding Semgrep rules using LLM-based pattern extraction
35+
2. **Rule Filtering System**: Validates and filters generated rules through multiple quality checks
36+
37+
### 2.2 Rule Generation
38+
39+
The rule generation process involves:
40+
41+
1. Patch Analysis
42+
- Extraction of changed code segments
43+
- Language detection
44+
- Context analysis
45+
46+
2. LLM-Based Rule Creation
47+
- Pattern identification
48+
- Rule structure generation
49+
- Metadata enhancement
50+
51+
### 2.3 Rule Filtering
52+
53+
The filtering pipeline implements multiple stages:
54+
55+
1. **Duplicate Detection**
56+
- Uses sentence embeddings for semantic similarity
57+
- Identifies and removes redundant rules
58+
59+
2. **Quality Evaluation**
60+
- LLM-based assessment of rule generalizability
61+
- Elimination of project-specific patterns
62+
63+
3. **Validation**
64+
- Testing against original vulnerabilities
65+
- Verification of fix detection
66+
67+
## 3. Dataset and Methodology
68+
69+
### 3.1 Dataset
70+
71+
We utilized the oreFixes dataset, a comprehensive collection of CVE fix commits:
72+
73+
- Total Patches: 39,931
74+
- Unique CVEs: 26,617
75+
- Source Repositories: 6,945
76+
77+
### 3.2 Processing Pipeline
78+
79+
```mermaid
80+
graph TD
81+
A[39,931 Initial Patches] --> B[Language Detection & Processing]
82+
B --> C[3,591 Generated Rules]
83+
C --> D[Duplicate Filtering]
84+
D --> E[Quality Assessment]
85+
E --> F[645 Final Rules]
86+
87+
style A fill:#f0f8ff,stroke:#333,stroke-width:2px
88+
style C fill:#f5f5dc,stroke:#333,stroke-width:2px
89+
style F fill:#f0fff0,stroke:#333,stroke-width:2px
90+
```
91+
92+
## 4. Results and Analysis
93+
94+
### 4.1 Generation Statistics
95+
96+
Initial Dataset:
97+
- 39,931 patches processed
98+
- 26,617 unique CVEs
99+
- 6,945 source repositories
100+
101+
Generation Results:
102+
- 3,591 rules generated
103+
- Coverage across 20 programming languages
104+
105+
### 4.2 Filtering Results
106+
107+
Rule Reduction Analysis:
108+
- Initial Rules: 3,591 (100%)
109+
- Duplicates Removed: 386 (10.75%)
110+
- Trivial Rules Removed: 5 (0.14%)
111+
- Overly Specific Rules Removed: 2,555 (71.15%)
112+
- Final Rules: 645 (17.96%)
113+
114+
### 4.3 Language Distribution
115+
116+
The final rule set spans 20 programming languages, providing broad coverage across different technology stacks. Key languages include:
117+
- Python
118+
- JavaScript
119+
- Java
120+
- Go
121+
- Ruby
122+
123+
## 5. Discussion
124+
125+
### 5.1 Rule Quality Analysis
126+
127+
The high percentage of rules removed due to over-specificity (71.15%) demonstrates the importance of our filtering pipeline in ensuring rule generalizability. The relatively low number of trivial rules (0.14%) suggests that the LLM-based generation process produces substantive patterns.
128+
129+
### 5.2 Effectiveness of Automation
130+
131+
The successful transformation of 39,931 patches into 645 high-quality rules demonstrates the effectiveness of automated rule generation. The reduction ratio indicates strong quality control while maintaining significant coverage.
132+
133+
### 5.3 Comparison with Manual Curation
134+
135+
Compared to traditional manual rule creation:
136+
- Significantly faster generation time
137+
- Consistent quality through automated validation
138+
- Broader language coverage
139+
- Reduced maintenance overhead
140+
141+
## 6. Conclusion and Future Work
142+
143+
AutoGrep demonstrates the feasibility of automated security rule generation and filtering at scale. The system successfully processes a large dataset of vulnerability patches to produce a focused set of high-quality, permissively licensed security rules.
144+
145+
### Future Work
146+
147+
1. Enhancement of rule generation accuracy through improved LLM prompting
148+
2. Expansion of supported languages and vulnerability types
149+
3. Integration of community feedback mechanisms
150+
4. Development of automated rule update procedures
151+
152+
## References
153+
154+
1. [Semgrep Project](https://github.com/semgrep/semgrep)
155+
2. [OpenGrep Project](https://github.com/opengrep/opengrep)
156+
3. [Patched Codes Semgrep Rules](https://github.com/patched-codes/semgrep-rules)
157+
4. [MoreFixes Dataset](https://zenodo.org/records/13983082)

0 commit comments

Comments
 (0)