-
Notifications
You must be signed in to change notification settings - Fork 571
fix(server): support dedicated backend data structures and serialization logic for vector index. #2913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hahahahbenny
wants to merge
21
commits into
apache:vector-index
Choose a base branch
from
hahahahbenny:vector-index
base: vector-index
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…che#2893) * docs(pd): update test commands and improve documentation clarity * Update README.md --------- Co-authored-by: imbajin <jin@apache.org>
* update(store): fix some problem and clean up code - chore(store): clean some comments - chore(store): using Slf4j instead of System.out to print log - update(store): update more reasonable timeout setting - update(store): add close method for CopyOnWriteCache to avoid potential memory leak - update(store): delete duplicated beginTx() statement - update(store): extract parameter for compaction thread pool(move to configuration file in the future) - update(store): add default logic in AggregationFunctions - update(store): fix potential concurrency problem in QueryExecutor * Update hugegraph-store/hg-store-common/src/main/java/org/apache/hugegraph/store/query/func/AggregationFunctions.java --------- Co-authored-by: Peng Junzhi <78788603+Pengzna@users.noreply.github.com>
* fix(store): fix duplicated definition log root
…p ci & remove duplicate module (apache#2910) * add missing license and remove binary license.txt * remove dist in commons * fix tinkerpop test open graph panic and other bugs * empty commit to trigger ci
…fields to the index label.
# This is the 1st commit message: add Licensed to files # This is the commit message apache#2: feat(server): support vector index in graphdb (apache#2856) * feat(server): Add the vector index type and the detection of related fields to the index label. * fix code format * add annsearch API * add doc to explain the plan delete redundency in vertexapi
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose of the PR
To support vector indexing in HugeGraph, dedicated backend data structures and serialization logic need to be added.
Main Changes
In-Memory Data Structure
HugeVectorIndexMapsequence(offset) andIndexVectorState(dirty flag, metadata).Type-System Extensions
IndexTypeVECTORHugeTypeVECTOR_INDEX_MAPHugeTypeVECTOR_SEQUENCE2.1 Column family design
VECTOR_INDEX_MAP
elemId: ID of the vertex being indexed.sequence:long.vectorStateCode = IndexVectorState.code(): state of the vector index (e.g., BUILDING / FLUSHED / DELETING).[1B type][4B indexId][4B vectorId][4B elemId][(optional) VLong expiredTime][8B sequence][1B vectorStateCode]VECTOR_SEQUENCE
vectorIdencoded asint.[1B dirty_prefix][4B indexId][8B sequence][4B vectorId]2.2 vector index state machine
stateDiagram-v2 [*] --> BUILDING: user writes vector<br/>GraphTransaction.commit() BUILDING --> FLUSHED: VectorIndexManager consumes<br/>and flushes to snapshot FLUSHED --> BUILDING: user modifies vector<br/>update operation FLUSHED --> DELETING: user deletes vector<br/>delete operation BUILDING --> DELETING: vector under construction deleted<br/>delete operation DELETING --> BUILDING: deleted vector re-written<br/>write operation DELETING --> [*]: VectorIndexManager consumes deletion<br/>physically purged from RocksDBOn-Disk Binary Layout
Serializer entry points:
Target column families:
cf_vector_index_map– stores the state of each vector index.cf_vector_seq_index– stores monotonically increasing sequence IDs.Test Coverage
VectorIndexSerializerTestLocks down the byte-level format for:
Guarantees backward compatibility if the format ever changes.
内存数据结构
HugeVectorIndexMapsequence(偏移)与IndexVectorState(脏标记及元数据)。类型系统扩展
IndexTypeVECTORHugeTypeVECTOR_INDEX_MAPHugeTypeVECTOR_SEQUENCE2.1 两个CF的key value设计
VECTOR_INDEX_MAP
elemId为被索引顶点 ID。sequence为longvectorStateCode = IndexVectorState.code()表示向量索引状态(如 BUILDING/FLUSHED/DELETING)。[1B type][4B indexId][4B vectorId][4B elemId][(可选)VLong expiredTime][8B sequence][1B vectorStateCode];VECTOR_SEQUENCE
vectorId作为int编码2.2 vector state 状态以及状态机变化
stateDiagram-v2 [*] --> BUILDING: 用户写入向量<br/>GraphTransaction.commit() BUILDING --> FLUSHED: VectorIndexManager 消费<br/>并落盘到快照 FLUSHED --> BUILDING: 用户修改向量<br/>更新操作 FLUSHED --> DELETING: 用户删除向量<br/>删除操作 BUILDING --> DELETING: 构建中的向量被删除<br/>删除操作 DELETING --> BUILDING: 已删除向量被重新写入<br/>写入操作 DELETING --> [*]: VectorIndexManager 消费删除<br/>物理清理 RocksDB落盘二进制格式
序列化入口:
目标列族:
cf_vector_index_map– 存储每条向量索引状态。cf_vector_seq_index– 存储单调递增的序列号。测试锁定
VectorIndexSerializerTest固化以下字段的字节级格式:
防止后续意外变更,保证向后兼容。
Verifying these changes
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need