|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "AutoGrep: Automated Generation and Filtering of Semgrep Rules from Vulnerability Patches" |
| 4 | +--- |
| 5 | +## Abstract |
| 6 | + |
| 7 | +This article presents [AutoGrep](https://github.com/lambdasec/autogrep), an automated system for generating and filtering high-quality security rules for static analysis tools. Motivated by recent licensing changes in the Semgrep ecosystem, AutoGrep addresses the critical need for maintaining and expanding permissively licensed security rules. By leveraging Large Language Models (LLMs) and a multi-stage filtering pipeline, AutoGrep transforms vulnerability patches into precise, generalizable security rules while eliminating duplicates and overly specific patterns. |
| 8 | + |
| 9 | +## 1. Introduction |
| 10 | + |
| 11 | +### 1.1 Background |
| 12 | + |
| 13 | +Static Analysis Security Testing (SAST) tools play a crucial role in modern software security. Semgrep, a popular open-source SAST tool, has gained widespread adoption due to its effectiveness and extensive rule set. However, recent changes in Semgrep's licensing model have created challenges for the security community, leading to the emergence of alternative solutions like OpenGrep. |
| 14 | + |
| 15 | +### 1.2 Motivation |
| 16 | + |
| 17 | +The recent transition of Semgrep's official rules to non-permissive licensing has created a significant gap in the open-source security ecosystem. This change has prompted the creation of OpenGrep, a community fork supported by security vendors, highlighting the need for permissively licensed security rules. Traditional manual rule curation is time-consuming and requires constant maintenance to remain effective. |
| 18 | + |
| 19 | +### 1.3 Contributions |
| 20 | + |
| 21 | +We make the following contributions: |
| 22 | + |
| 23 | +1. Introduces an automated pipeline for generating Semgrep rules from vulnerability patches |
| 24 | +2. Presents a novel filtering system that ensures rule quality and generalizability |
| 25 | +3. Demonstrates the effectiveness of using LLMs for security rule generation |
| 26 | +4. Provides a significant set of permissively licensed security rules across multiple languages |
| 27 | + |
| 28 | +## 2. System Architecture |
| 29 | + |
| 30 | +### 2.1 Overview |
| 31 | + |
| 32 | +AutoGrep consists of two main components: |
| 33 | + |
| 34 | +1. **Rule Generation Pipeline**: Analyzes vulnerability patches and generates corresponding Semgrep rules using LLM-based pattern extraction |
| 35 | +2. **Rule Filtering System**: Validates and filters generated rules through multiple quality checks |
| 36 | + |
| 37 | +### 2.2 Rule Generation |
| 38 | + |
| 39 | +The rule generation process involves: |
| 40 | + |
| 41 | +1. Patch Analysis |
| 42 | + - Extraction of changed code segments |
| 43 | + - Language detection |
| 44 | + - Context analysis |
| 45 | + |
| 46 | +2. LLM-Based Rule Creation |
| 47 | + - Pattern identification |
| 48 | + - Rule structure generation |
| 49 | + - Metadata enhancement |
| 50 | + |
| 51 | +### 2.3 Rule Filtering |
| 52 | + |
| 53 | +The filtering pipeline implements multiple stages: |
| 54 | + |
| 55 | +1. **Duplicate Detection** |
| 56 | + - Uses sentence embeddings for semantic similarity |
| 57 | + - Identifies and removes redundant rules |
| 58 | + |
| 59 | +2. **Quality Evaluation** |
| 60 | + - LLM-based assessment of rule generalizability |
| 61 | + - Elimination of project-specific patterns |
| 62 | + |
| 63 | +3. **Validation** |
| 64 | + - Testing against original vulnerabilities |
| 65 | + - Verification of fix detection |
| 66 | + |
| 67 | +## 3. Dataset and Methodology |
| 68 | + |
| 69 | +### 3.1 Dataset |
| 70 | + |
| 71 | +We utilized the oreFixes dataset, a comprehensive collection of CVE fix commits: |
| 72 | + |
| 73 | +- Total Patches: 39,931 |
| 74 | +- Unique CVEs: 26,617 |
| 75 | +- Source Repositories: 6,945 |
| 76 | + |
| 77 | +### 3.2 Processing Pipeline |
| 78 | + |
| 79 | +```mermaid |
| 80 | +graph TD |
| 81 | + A[39,931 Initial Patches] --> B[Language Detection & Processing] |
| 82 | + B --> C[3,591 Generated Rules] |
| 83 | + C --> D[Duplicate Filtering] |
| 84 | + D --> E[Quality Assessment] |
| 85 | + E --> F[645 Final Rules] |
| 86 | +
|
| 87 | + style A fill:#f0f8ff,stroke:#333,stroke-width:2px |
| 88 | + style C fill:#f5f5dc,stroke:#333,stroke-width:2px |
| 89 | + style F fill:#f0fff0,stroke:#333,stroke-width:2px |
| 90 | +``` |
| 91 | + |
| 92 | +## 4. Results and Analysis |
| 93 | + |
| 94 | +### 4.1 Generation Statistics |
| 95 | + |
| 96 | +Initial Dataset: |
| 97 | +- 39,931 patches processed |
| 98 | +- 26,617 unique CVEs |
| 99 | +- 6,945 source repositories |
| 100 | + |
| 101 | +Generation Results: |
| 102 | +- 3,591 rules generated |
| 103 | +- Coverage across 20 programming languages |
| 104 | + |
| 105 | +### 4.2 Filtering Results |
| 106 | + |
| 107 | +Rule Reduction Analysis: |
| 108 | +- Initial Rules: 3,591 (100%) |
| 109 | +- Duplicates Removed: 386 (10.75%) |
| 110 | +- Trivial Rules Removed: 5 (0.14%) |
| 111 | +- Overly Specific Rules Removed: 2,555 (71.15%) |
| 112 | +- Final Rules: 645 (17.96%) |
| 113 | + |
| 114 | +### 4.3 Language Distribution |
| 115 | + |
| 116 | +The final rule set spans 20 programming languages, providing broad coverage across different technology stacks. Key languages include: |
| 117 | +- Python |
| 118 | +- JavaScript |
| 119 | +- Java |
| 120 | +- Go |
| 121 | +- Ruby |
| 122 | + |
| 123 | +## 5. Discussion |
| 124 | + |
| 125 | +### 5.1 Rule Quality Analysis |
| 126 | + |
| 127 | +The high percentage of rules removed due to over-specificity (71.15%) demonstrates the importance of our filtering pipeline in ensuring rule generalizability. The relatively low number of trivial rules (0.14%) suggests that the LLM-based generation process produces substantive patterns. |
| 128 | + |
| 129 | +### 5.2 Effectiveness of Automation |
| 130 | + |
| 131 | +The successful transformation of 39,931 patches into 645 high-quality rules demonstrates the effectiveness of automated rule generation. The reduction ratio indicates strong quality control while maintaining significant coverage. |
| 132 | + |
| 133 | +### 5.3 Comparison with Manual Curation |
| 134 | + |
| 135 | +Compared to traditional manual rule creation: |
| 136 | +- Significantly faster generation time |
| 137 | +- Consistent quality through automated validation |
| 138 | +- Broader language coverage |
| 139 | +- Reduced maintenance overhead |
| 140 | + |
| 141 | +## 6. Conclusion and Future Work |
| 142 | + |
| 143 | +AutoGrep demonstrates the feasibility of automated security rule generation and filtering at scale. The system successfully processes a large dataset of vulnerability patches to produce a focused set of high-quality, permissively licensed security rules. |
| 144 | + |
| 145 | +### Future Work |
| 146 | + |
| 147 | +1. Enhancement of rule generation accuracy through improved LLM prompting |
| 148 | +2. Expansion of supported languages and vulnerability types |
| 149 | +3. Integration of community feedback mechanisms |
| 150 | +4. Development of automated rule update procedures |
| 151 | + |
| 152 | +## References |
| 153 | + |
| 154 | +1. [Semgrep Project](https://github.com/semgrep/semgrep) |
| 155 | +2. [OpenGrep Project](https://github.com/opengrep/opengrep) |
| 156 | +3. [Patched Codes Semgrep Rules](https://github.com/patched-codes/semgrep-rules) |
| 157 | +4. [MoreFixes Dataset](https://zenodo.org/records/13983082) |
0 commit comments