-
Notifications
You must be signed in to change notification settings - Fork 1
Add auth support, split-by-type option, and pagination for large graphs #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add --username and --password options for FalkorDB authentication - Add --split-by-type/--no-split-by-type flags to control CSV file organization - Implement pagination to export all nodes and edges (fixes 10k limit) - Add progress feedback during export
WalkthroughExport tool now accepts optional Changes
Sequence Diagram(s)sequenceDiagram
participant CLI
participant Export as export_graph
participant DB as FalkorDB
participant CSV as CSV Writer
rect rgb(245,250,255)
note over CLI,Export: CLI invokes export with auth & options
CLI->>Export: graph_name, host, port, username?, password?, split_by_type
end
rect rgb(225,245,225)
note over Export,DB: Batched Node Retrieval (last_id)
loop while batch returned
Export->>DB: Query nodes WHERE id > last_id LIMIT batch_size
DB-->>Export: node batch
Export->>Export: Accumulate nodes by label / log progress
end
end
rect rgb(255,245,230)
note over Export,CSV: Conditional Node Output
alt split_by_type = True
Export->>CSV: Write per-label node CSVs
else
Export->>CSV: Write combined `nodes.csv` (with label column)
end
end
rect rgb(225,245,225)
note over Export,DB: Batched Edge Retrieval (last_id)
loop while batch returned
Export->>DB: Query edges WHERE id > last_id LIMIT batch_size
DB-->>Export: edge batch
Export->>Export: Accumulate edges by type / log progress
end
end
rect rgb(255,245,230)
note over Export,CSV: Conditional Edge Output
alt split_by_type = True
Export->>CSV: Write per-type edge CSVs
else
Export->>CSV: Write combined `edges.csv` (with type column)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45–60 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #26 +/- ##
==========================================
- Coverage 86.53% 85.43% -1.11%
==========================================
Files 1 1
Lines 52 103 +51
==========================================
+ Hits 45 88 +43
- Misses 7 15 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
- Document authentication options (--username, --password) - Explain --split-by-type and --no-split-by-type modes - Add comprehensive usage examples - Document pagination support for large graphs - Update output section with file naming patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (3)
main.py (3)
59-75: Memory usage still O(total). Stream to CSV per batch.For very large graphs, accumulate-then-write is memory heavy. Append per batch with header once.
# per label/type file: write each batch df.to_csv(filename, mode="a", index=False, header=not os.path.exists(filename))Optionally precompute column union to stabilize headers.
Also applies to: 115-130
148-158: Simplify boolean flag with BooleanOptionalAction.You don’t need two flags; argparse can generate --split-by-type/--no-split-by-type automatically.
- parser.add_argument( - "--split-by-type", - action="store_true", - default=True, - help="Create separate CSV files per node label and edge type (default: True)" - ) - parser.add_argument( - "--no-split-by-type", - dest="split_by_type", - action="store_false", - help="Create single CSV files for all nodes and edges" - ) + parser.add_argument( + "--split-by-type", + dest="split_by_type", + action=argparse.BooleanOptionalAction, + default=True, + help="Create separate CSV files per node label and edge type (use --no-split-by-type to combine)" + )Requires Python 3.9+.
59-64: Optional: mitigate CSV injection when opening in Excel.If properties can start with =,+,-,@, consider prefixing dangerous values before writing.
def sanitize_cell(v): if isinstance(v, str) and v and v[0] in "=+-@": return "'" + v return v # apply before DataFrame(...) rows = [{k: sanitize_cell(v) for k, v in row.items()} for row in rows]Also applies to: 115-121
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
main.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.12)
🔇 Additional comments (1)
main.py (1)
7-16: No issues found — FalkorDB auth kwargs are correct.The FalkorDB Python client accepts host, port, username, and password as constructor kwargs, and the code correctly passes these parameters. No changes needed.
| while True: | ||
| nodes_result = g.ro_query( | ||
| f"MATCH (n) RETURN ID(n), labels(n), properties(n) SKIP {skip} LIMIT {batch_size}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Pagination via SKIP/LIMIT scales poorly at high offsets; consider ID-based pagination and parameters.
Large SKIPs become O(N). Prefer WHERE ID(x) > last_id ORDER BY ID(x) LIMIT $limit. Also parameterize skip/limit to avoid string interpolation.
Example pattern:
# Nodes
q = "MATCH (n) WHERE ID(n) > $last_id RETURN ID(n), labels(n), properties(n) ORDER BY ID(n) LIMIT $limit"
params = {"last_id": last_id, "limit": batch_size}
# Repeat updating last_id from final recordAlso applies to: 83-86
🌐 Web query:
Does FalkorDB support parameterized queries in Python (ro_query with params)? Recommended pattern for paginating large result sets (ID-based) in FalkorDB/RedisGraph?
💡 Result:
Short answers
- Yes — FalkorDB supports parameterized queries and you can pass params from Python. The GRAPH.QUERY docs show the parameterized syntax and a Python example; the read-only command GRAPH.RO_QUERY exists as well, and the client follows the same query/params pattern. [1][2]
Recommended pattern for paginating large result sets (ID / cursor based)
- Use cursor (ID or indexed timestamp) pagination rather than OFFSET/LIMIT. Keep an indexed, monotonic property on nodes (e.g. created_at or a numeric id) and paginate with a WHERE clause plus ORDER BY and LIMIT:
- Example Cypher (use your node property name instead of my_id):
- MATCH (n:MyLabel) WHERE n.my_id > $cursor ORDER BY n.my_id ASC LIMIT $limit RETURN n
- Pass cursor and limit as parameters (so queries are cached and safe). On the first page use cursor = 0 (or null-handling) and each page set cursor = last_row.my_id from the previous page.
- Example Cypher (use your node property name instead of my_id):
- Benefits: uses an index so the database jumps to the next range (fast, stable under inserts), avoids scanning/skipping rows (OFFSET) and reduces memory/IO pressure.
- Caveats:
- Ensure the paginating property is indexed and monotonic (or use created_at if strictly increasing).
- If you need strict snapshot semantics across many pages, you’ll need additional application-level handling because cursor pagination is eventually consistent with concurrent writes.
- Use FalkorDB configuration limits (RESULTSET_SIZE / timeouts) to avoid runaway queries on unexpectedly large requests. [3][4][5]
References
- GRAPH.QUERY (parameterized query examples, Python). [1] https://docs.falkordb.com/commands/graph.query.html
- GRAPH.RO_QUERY (read-only query command). [2] https://docs.falkordb.com/commands/graph.ro-query.html
- FalkorDB configuration (RESULTSET_SIZE, timeouts). [3] https://docs.falkordb.com/getting-started/configuration.html
- Pagination best-practices / avoid OFFSET (cursor/id pagination recommended). [4] https://configzen.com/blog/master-pagination-mysql-pitfalls
- Cursor-based pagination overview (why it’s preferred for large sets). [5] https://www.carens.dev/a-friendly-guide-to-pagination
If you want, I can show the exact Python call form for graph.ro_query with params using the FalkorDB Python client.
I'll verify the actual code in main.py to ensure the refactoring suggestion is applicable.
I need to inspect the actual code in main.py to verify the current implementation and determine whether the refactoring suggestion is accurate and appropriate.
[/run_scripts]
Once I have the code, I'll verify the exact pagination pattern and determine if the refactoring is applicable and correctly suggested.
🏁 Script executed:
#!/bin/bash
# Inspect the code at lines 25-28 and 83-86 in main.py
echo "=== Lines 25-28 ===" && sed -n '25,28p' main.py
echo ""
echo "=== Lines 83-86 ===" && sed -n '83,86p' main.py
echo ""
echo "=== Context around lines 25-28 (20-35) ===" && sed -n '20,35p' main.py
echo ""
echo "=== Context around lines 83-86 (75-95) ===" && sed -n '75,95p' main.pyLength of output: 1486
Replace SKIP/LIMIT with parameterized cursor-based pagination using indexed node/edge properties, not internal IDs.
FalkorDB's internal IDs are not immutable—higher IDs may be reassigned to lower positions after deletion, making ID-based cursors unreliable for pagination. Instead:
- Use indexed properties (e.g.,
created_at,uuid, or a stablenode_idattribute) for cursor pagination. - Parameterize queries to avoid string interpolation and improve performance.
For nodes (lines 25–28):
q = "MATCH (n) WHERE n.id > $last_id RETURN ID(n), labels(n), properties(n) ORDER BY n.id LIMIT $limit"
params = {"last_id": last_id, "limit": batch_size}
nodes_result = g.ro_query(q, params)For edges (lines 83–86):
q = "MATCH (a)-[e]->(b) WHERE e.id > $last_id RETURN ID(e), TYPE(e), ID(a), ID(b), properties(e) ORDER BY e.id LIMIT $limit"
params = {"last_id": last_id, "limit": batch_size}
edges_result = g.ro_query(q, params)Ensure the pagination property (n.id, e.id) is indexed and monotonically increasing.
| labels = record[1] | ||
| props = record[2] or {} | ||
|
|
||
| # Handle nodes with multiple labels or no labels | ||
| if labels: | ||
| for label in labels: | ||
| node = {"id": node_id} | ||
| node.update(props) | ||
| nodes_by_label[label].append(node) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels(n) type guard to avoid iterating characters.
If labels(n) yields a string, the current loop will iterate per-char. Guard it.
- labels = record[1]
+ labels = record[1] or []
+ if not isinstance(labels, (list, tuple)):
+ labels = [labels]🤖 Prompt for AI Agents
In main.py around lines 35 to 44 the code may iterate a string label one
character at a time; before the loop ensure labels is treated as a sequence of
labels (not a str) — e.g. if labels is a str, wrap it as a single-item list (or
otherwise convert into the expected iterable) and then run the existing for
label in labels logic; keep the existing None/empty handling intact.
| node = {"id": node_id} | ||
| node.update(props) | ||
| nodes_by_label[label].append(node) | ||
| else: | ||
| # Handle nodes without labels | ||
| node = {"id": node_id} | ||
| node.update(props) | ||
| nodes_by_label[label].append(node) | ||
| else: | ||
| # Handle nodes without labels | ||
| node = {"id": node_id} | ||
| node.update(props) | ||
| nodes_by_label["unlabeled"].append(node) | ||
|
|
||
| # Export each node label to its own CSV file | ||
| for label, nodes in nodes_by_label.items(): | ||
| filename = f"nodes_{label}.csv" | ||
| pd.DataFrame(nodes).to_csv(filename, index=False) | ||
| print(f"✅ Exported {len(nodes)} nodes with label '{label}' to {filename}") | ||
| nodes_by_label["unlabeled"].append(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Props can overwrite IDs — use non-colliding field names.
props may contain keys like 'id', overwriting your synthetic identifiers and corrupting exports.
Apply:
- node = {"id": node_id}
+ node = {"node_id": node_id}
node.update(props)
...
- node = {"id": node_id}
+ node = {"node_id": node_id}
node.update(props)
...
- edge = {
- "id": edge_id,
- "from_id": from_id,
- "to_id": to_id
- }
+ edge = {
+ "edge_id": edge_id,
+ "from_id": from_id,
+ "to_id": to_id
+ }
edge.update(props)Also applies to: 98-105
🤖 Prompt for AI Agents
In main.py around lines 41 to 48 (and likewise at lines 98 to 105), props may
contain an "id" key and will overwrite the synthetic node_id when you call
node.update(props); fix by building nodes so props cannot collide with the
synthetic id—create the node dict with the synthetic id under "id" (or use a
reserved key like "_id") and merge props into a separate key or filter out "id"
from props before updating (e.g., copy props minus "id" or nest props under
"props") so the synthetic identifier is preserved.
| filename = f"nodes_{label}.csv" | ||
| pd.DataFrame(nodes).to_csv(filename, index=False) | ||
| print(f"✅ Exported {len(nodes)} nodes with label '{label}' to {filename}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Sanitize filenames to avoid OS/path issues.
Label/type may contain spaces or separators. Sanitize before composing filenames.
- filename = f"nodes_{label}.csv"
+ filename = f"nodes_{safe_filename(label)}.csv"
...
- filename = f"edges_{edge_type}.csv"
+ filename = f"edges_{safe_filename(edge_type)}.csv"Add once near imports:
import re
def safe_filename(name: str) -> str:
name = str(name).strip().replace(" ", "_")
name = re.sub(r"[^A-Za-z0-9_.-]+", "_", name)
return name[:200] or "unnamed"Also applies to: 118-120
🤖 Prompt for AI Agents
In main.py around lines 62 to 64 (and also apply to lines 118 to 120), filenames
are composed directly from label/type which may contain spaces or unsafe
characters; add the provided safe_filename helper near the imports (and import
re) and use it when building filenames (e.g., filename =
f"nodes_{safe_filename(label)}.csv") so labels are sanitized, truncated to a
safe length, and invalid chars replaced.
| all_nodes = [] | ||
| for label, nodes in nodes_by_label.items(): | ||
| for node in nodes: | ||
| node["label"] = label | ||
| all_nodes.append(node) | ||
| if all_nodes: | ||
| pd.DataFrame(all_nodes).to_csv("nodes.csv", index=False) | ||
| print(f"✅ Exported {len(all_nodes)} nodes to nodes.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mutating records and avoid 'label'/'type' key collisions.
Use copy with reserved column names to prevent clobbering user props.
- for label, nodes in nodes_by_label.items():
- for node in nodes:
- node["label"] = label
- all_nodes.append(node)
+ for label, nodes in nodes_by_label.items():
+ for node in nodes:
+ all_nodes.append({**node, "node_label": label})
...
- for edge_type, edges in edges_by_type.items():
- for edge in edges:
- edge["type"] = edge_type
- all_edges.append(edge)
+ for edge_type, edges in edges_by_type.items():
+ for edge in edges:
+ all_edges.append({**edge, "edge_type": edge_type})Also applies to: 124-129
🤖 Prompt for AI Agents
In main.py around lines 67-74 (and similarly 124-129) you are mutating the
original node dicts and potentially clobbering user keys named "label" or
"type"; instead, create a shallow copy of each node (e.g., node_copy =
node.copy()) and add a reserved column name (use a prefixed name unlikely to
collide, e.g., "_label" or "__label" and "_type"/"__type") to the copy before
appending; then write the list of copies to CSV so original records remain
unmodified and user properties are preserved.
| parser.add_argument("--username", help="FalkorDB username (optional)") | ||
| parser.add_argument("--password", help="FalkorDB password (optional)") | ||
| parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passwords on the command line leak via ps/history — prefer env var or prompt.
Accepting --password is risky. Fallback to env or interactive prompt when username is set.
import os, getpass
...
args = parser.parse_args()
if args.username and not args.password:
args.password = os.getenv("FALKORDB_PASSWORD") or getpass.getpass("FalkorDB password: ")Document FALKORDB_PASSWORD in README.
Also applies to: 160-163
🤖 Prompt for AI Agents
In main.py around lines 145-147 and 160-163, avoid accepting plaintext passwords
via the --password CLI flag; when a username is provided and no password was
supplied on the command line, read the password from the FALKORDB_PASSWORD
environment variable or prompt interactively using getpass.getpass() instead.
Update argument parsing to treat --password as optional (or remove it) and add
post-parse logic that if args.username and not args.password then set
args.password = os.getenv("FALKORDB_PASSWORD") or prompt the user; ensure
imports for os and getpass are added and that interactive prompting only occurs
when stdin is a TTY. Also update the README to document the FALKORDB_PASSWORD
environment variable and the interactive prompt behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
README.md (5)
97-101: Tighten auth examples to avoid teaching insecure habits.Great examples. Add a one-line caution under the auth examples to note CLI password exposure; consider showing a safer pattern (e.g., prompt) if the tool supports it.
Suggested tweak:
**Export with authentication:** ```bash -python main.py social --username myuser --password mypass +python main.py social --username myuser --password '<enter-password>'+Note: Passing passwords on the command line can leak via history/
ps. Prefer an interactive prompt or environment configuration if supported.Also applies to: 102-106, 107-111, 112-116, 117-121 --- `125-143`: **Document how unlabeled nodes are handled in split mode.** If unlabeled nodes exist, specify the filename convention (e.g., nodes_UNLABELED.csv) and whether multiple labels produce one row per node (union of labels) or a label-joined representation. Proposed addition: ```diff #### Node files: `nodes_<label>.csv` Each node label gets its own file (e.g., `nodes_Person.csv`, `nodes_Company.csv`): - `id`: Node ID - Additional columns for each node property +Notes: +- Unlabeled nodes are written to `nodes_UNLABELED.csv` (if present). +- Nodes with multiple labels appear in each corresponding label file.
153-160: Clarify column unification and missing values in combined mode.State that the schema is the union of properties across labels/types and that missing properties are left blank/NA, so users know how to parse.
Suggested text:
#### nodes.csv All nodes in a single file: - `id`: Node ID - `label`: Node label/type - Additional columns for each node property +Schema notes: columns are the union of all encountered node properties; properties absent on a node are left empty. @@ #### edges.csv All edges in a single file: - `id`: Edge ID - `type`: Relationship type - `from_id`: Source node ID - `to_id`: Target node ID - Additional columns for each edge property +Schema notes: columns are the union of all encountered edge properties; properties absent on an edge are left empty.
161-164: Add --batch-size CLI flag for configurable pagination (currently hardcoded to 10,000).Verified: batch_size is hardcoded to 10,000 in main.py (lines 21, 79) with no CLI or environment variable override. While this matches the README documentation, adding a
--batch-sizeflag would improve usability for users with memory constraints or specific performance requirements.
76-93: Add documentation clarifications for flag exclusivity and password security.The code correctly implements mutually exclusive flags via
dest="split_by_type"with opposite actions (store_truevs.store_false), withdefault=True. However, the README help text does not clarify this relationship or the default behavior. Additionally, the implementation passes passwords directly from the CLI with no environment variable or interactive input alternative, making the security warning in the proposed diff valid and actionable.Apply the suggested documentation update to clarify flag behavior and add the security note.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
README.md(2 hunks)
🔇 Additional comments (2)
README.md (2)
20-22: Nice additions to call out.Feature bullets align with the PR scope; no issues.
80-81: Wording LGTM.“by label/type” matches the two output modes described below.
- Test authentication parameter acceptance - Test split-by-type mode (both True and False) - Test combined mode output (single nodes.csv and edges.csv) - Test default behavior (split mode) - Test pagination with larger graphs (100 nodes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
test_main.py (3)
247-254: Test doesn't validate authentication functionality.The test passes
Nonefor both username and password, which only verifies that the function accepts these parameters but doesn't test actual authentication behavior. Consider adding test cases that:
- Use real credentials when the FalkorDB test instance has authentication enabled
- Test authentication failure scenarios with invalid credentials
256-272: Use exact counts for more precise assertions.The assertions use
>=for minimum counts, but since the test_graph fixture creates a deterministic dataset (exactly 3 nodes and 2 edges), the test should assert exact values for more precision.Apply this diff to strengthen the assertions:
- assert len(nodes_df) >= 3, "Should have at least 3 nodes (2 Person + 1 Company)" + assert len(nodes_df) == 3, "Should have exactly 3 nodes (2 Person + 1 Company)" # Verify edges.csv has type column edges_df = pd.read_csv("edges.csv") assert "type" in edges_df.columns, "edges.csv should have type column" - assert len(edges_df) >= 2, "Should have at least 2 edges (WORKS_FOR + KNOWS)" + assert len(edges_df) == 2, "Should have exactly 2 edges (WORKS_FOR + KNOWS)"
297-298: Replace bare except with specific exceptions.The bare
exceptclauses on lines 297 and 321 are overly broad and can hide unexpected errors. Additionally, silently ignoring exceptions makes debugging harder.Apply this diff to catch specific exceptions:
# Clear any existing data try: g.query("MATCH (n) DETACH DELETE n") - except: - pass + except Exception as e: + # Graph might not exist yet or be empty + print(f"Warning: Failed to clear graph: {e}") # Create 100 nodes (simulating pagination, though batch size is 10000)And similarly for the cleanup section:
# Cleanup try: g.query("MATCH (n) DETACH DELETE n") - except: - pass + except Exception as e: + print(f"Warning: Failed to cleanup graph: {e}")Also applies to: 321-322
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test_main.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test_main.py (1)
main.py (1)
export_graph(7-135)
🪛 Ruff (0.14.2)
test_main.py
247-247: Unused method argument: temp_dir
(ARG002)
256-256: Unused method argument: temp_dir
(ARG002)
274-274: Unused method argument: temp_dir
(ARG002)
288-288: Unused method argument: temp_dir
(ARG002)
297-297: Do not use bare except
(E722)
297-298: try-except-pass detected, consider logging the exception
(S110)
321-321: Do not use bare except
(E722)
321-322: try-except-pass detected, consider logging the exception
(S110)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.12)
🔇 Additional comments (1)
test_main.py (1)
274-286: LGTM!Good test coverage for verifying the default split-by-type behavior, including negative assertions to ensure combined files aren't created.
| import pandas as pd | ||
|
|
||
| def export_graph(graph_name, host, port): | ||
| def export_graph(graph_name, host, port, username=None, password=None, split_by_type=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use url instead
|
|
||
| while True: | ||
| nodes_result = g.ro_query( | ||
| f"MATCH (n) RETURN ID(n), labels(n), properties(n) SKIP {skip} LIMIT {batch_size}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be faster if instead of skip filter wit id(n) > last id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id is not necessarily sortable... I think current approach is "good enough"...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query specifically under the hood is scan the nodes by id
it will be slow when the graph is very large
…nation - Use redis:// URL format for connection instead of separate parameters - Replace SKIP-based pagination with ID-based filtering (WHERE ID(n) > last_id) - Add ORDER BY ID(n)/ID(e) for consistent pagination - Improves performance on large graphs by avoiding expensive SKIP operations
The ID-based pagination approach failed because FalkorDB/Cypher doesn't support ORDER BY on ID() function. Reverting to SKIP-based pagination while keeping the URL connection improvement from AviAvni's review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (8)
main.py (8)
27-31: Parameterize queries instead of string interpolation.Avoid f-string SQL/Cypher; improves safety and plan caching.
- nodes_result = g.ro_query( - f"MATCH (n) WHERE ID(n) > {last_id} RETURN ID(n), labels(n), properties(n) ORDER BY ID(n) LIMIT {batch_size}" - ) + nodes_query = ( + "MATCH (n) WHERE ID(n) > $last_id " + "RETURN ID(n), labels(n), properties(n) " + "ORDER BY ID(n) LIMIT $limit" + ) + nodes_result = g.ro_query(nodes_query, {"last_id": last_id, "limit": batch_size})- edges_result = g.ro_query( - f"MATCH (a)-[e]->(b) WHERE ID(e) > {last_id} RETURN ID(e), TYPE(e), ID(a), ID(b), properties(e) ORDER BY ID(e) LIMIT {batch_size}" - ) + edges_query = ( + "MATCH (a)-[e]->(b) WHERE ID(e) > $last_id " + "RETURN ID(e), TYPE(e), ID(a), ID(b), properties(e) " + "ORDER BY ID(e) LIMIT $limit" + ) + edges_result = g.ro_query(edges_query, {"last_id": last_id, "limit": batch_size})Also applies to: 87-91
66-66: Sanitize filenames derived from labels/types.Avoid spaces/unsafe chars in filenames.
Add helper near imports:
+import re + +def safe_filename(name: str) -> str: + name = str(name).strip().replace(" ", "_") + name = re.sub(r"[^A-Za-z0-9_.-]+", "_", name) + return name[:200] or "unnamed"Use it:
- filename = f"nodes_{label}.csv" + filename = f"nodes_{safe_filename(label)}.csv"- filename = f"edges_{edge_type}.csv" + filename = f"edges_{safe_filename(edge_type)}.csv"Also applies to: 124-124
71-75: Don’t mutate records; avoid 'label'/'type' collisions.Use copies and reserved column names to prevent clobbering user props.
- all_nodes = [] - for label, nodes in nodes_by_label.items(): - for node in nodes: - node["label"] = label - all_nodes.append(node) + all_nodes = [] + for label, nodes in nodes_by_label.items(): + for node in nodes: + all_nodes.append({**node, "node_label": label})- all_edges = [] - for edge_type, edges in edges_by_type.items(): - for edge in edges: - edge["type"] = edge_type - all_edges.append(edge) + all_edges = [] + for edge_type, edges in edges_by_type.items(): + for edge in edges: + all_edges.append({**edge, "edge_type": edge_type})Also applies to: 129-133
27-61: Cursor pagination over internal IDs can miss/duplicate rows; use a monotonic, indexed property.Internal IDs are not guaranteed stable across deletes; prefer WHERE n.my_cursor > $cursor ORDER BY n.my_cursor LIMIT $limit (same for edges).
If you have a property like created_at or uid, switch to it:
# Example (nodes) q = ("MATCH (n) WHERE EXISTS(n.uid) AND n.uid > $cursor " "RETURN ID(n), labels(n), properties(n) ORDER BY n.uid LIMIT $limit") params = {"cursor": last_uid, "limit": batch_size}Confirm which properties are suitable and indexed.
Also applies to: 87-119
38-45: Guard labels type to avoid iterating characters.labels can be a string; iterating would split into chars.
- labels = record[1] + labels = record[1] or [] + if not isinstance(labels, (list, tuple)): + labels = [labels]
44-51: Protect synthetic IDs from collision with props.props may contain 'id'; node.update(props) will overwrite it.
- node = {"id": node_id} - node.update(props) + node = {"node_id": node_id} + node.update({k: v for k, v in props.items() if k != "node_id"}) ... - node = {"id": node_id} - node.update(props) - nodes_by_label["unlabeled"].append(node) + node = {"node_id": node_id} + node.update({k: v for k, v in props.items() if k != "node_id"}) + nodes_by_label["__unlabeled__"].append(node)
103-109: Edge props can overwrite IDs; use non‑colliding keys.Same collision risk as nodes.
- edge = { - "id": edge_id, - "from_id": from_id, - "to_id": to_id - } - edge.update(props) + edge = { + "edge_id": edge_id, + "from_id": from_id, + "to_id": to_id + } + edge.update({k: v for k, v in props.items() if k not in ("edge_id", "from_id", "to_id")})
151-163: Avoid plaintext --password; support env var or interactive prompt.CLI args leak via ps/history. Prefer FALKORDB_PASSWORD or getpass when username is set.
args = parser.parse_args() - export_graph(args.graph_name, args.host, args.port, args.username, args.password, args.split_by_type) + import os, getpass, sys + if args.username and not args.password: + args.password = os.getenv("FALKORDB_PASSWORD") + if not args.password and sys.stdin.isatty(): + args.password = getpass.getpass("FalkorDB password: ") + export_graph(args.graph_name, args.host, args.port, args.username, args.password, args.split_by_type)Also document FALKORDB_PASSWORD in README.
Also applies to: 166-168
🧹 Nitpick comments (2)
main.py (2)
62-79: Memory scale: stream to CSV instead of accumulating whole graph.Very large graphs will blow memory. Write per batch (append mode) and clear buffers.
I can provide a targeted patch that flushes each batch to CSV with header=not exists and mode="a" per label/type, keeping memory bounded.
Also applies to: 120-136
154-164: Simplify flags with BooleanOptionalAction; consider exposing batch size.Cleaner UX and configurability.
- Replace the paired flags with:
- parser.add_argument("--split-by-type", action=argparse.BooleanOptionalAction, default=True)
- Optionally add:
- parser.add_argument("--batch-size", type=int, default=10000)
| if username and password: | ||
| url = f"redis://{username}:{password}@{host}:{port}" | ||
| elif username: | ||
| url = f"redis://{username}@{host}:{port}" | ||
| else: | ||
| url = f"redis://{host}:{port}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percent‑encode credentials in URL (special chars break the URI).
Usernames/passwords containing @/:?#& etc. will yield invalid URLs.
Apply:
- if username and password:
- url = f"redis://{username}:{password}@{host}:{port}"
- elif username:
- url = f"redis://{username}@{host}:{port}"
+ if username and password:
+ u = quote(username, safe="")
+ p = quote(password, safe="")
+ url = f"redis://{u}:{p}@{host}:{port}"
+ elif username:
+ u = quote(username, safe="")
+ url = f"redis://{u}@{host}:{port}"
else:
url = f"redis://{host}:{port}"Add import once:
-from collections import defaultdict
+from collections import defaultdict
+from urllib.parse import quote📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if username and password: | |
| url = f"redis://{username}:{password}@{host}:{port}" | |
| elif username: | |
| url = f"redis://{username}@{host}:{port}" | |
| else: | |
| url = f"redis://{host}:{port}" | |
| if username and password: | |
| u = quote(username, safe="") | |
| p = quote(password, safe="") | |
| url = f"redis://{u}:{p}@{host}:{port}" | |
| elif username: | |
| u = quote(username, safe="") | |
| url = f"redis://{u}@{host}:{port}" | |
| else: | |
| url = f"redis://{host}:{port}" |
🤖 Prompt for AI Agents
In main.py around lines 10-15, the code builds a redis URL with raw
username/password which breaks when credentials contain reserved chars; import
urllib.parse.quote_plus once at the top and use it to percent-encode username
and password (e.g. encoded_user = quote_plus(username, safe='') and encoded_pass
= quote_plus(password, safe='')) before inserting into the f-strings, keeping
the same conditional branches (use encoded_user and encoded_pass where
appropriate) so the generated redis:// URL is always valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
main.py (3)
7-17: Consider adding connection error handling.The function directly connects to FalkorDB without handling potential connection failures (invalid credentials, unreachable host, etc.). Adding a try-except block would provide better user feedback.
Example:
try: db = FalkorDB(url=url) g = db.select_graph(graph_name) except Exception as e: print(f"❌ Failed to connect to FalkorDB: {e}") raise
23-23: Consider extracting batch size as a module constant.The batch size
10000is hardcoded in both node and edge pagination loops (lines 23 and 81). Defining it once as a module-level constant would improve maintainability.+BATCH_SIZE = 10000 + def export_graph(graph_name, host, port, username=None, password=None, split_by_type=True): ... - batch_size = 10000 + batch_size = BATCH_SIZE
149-154: Redundant default in --split-by-type argument.The
default=Trueis unnecessary when usingaction="store_true"because argparse already defaults unset boolean flags toFalse, and the--no-split-by-typeflag handles the False case.parser.add_argument( "--split-by-type", action="store_true", - default=True, help="Create separate CSV files per node label and edge type (default: True)" )Note: The current implementation works correctly; this is just a minor cleanup.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
main.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.12)
🔇 Additional comments (1)
main.py (1)
164-164: LGTM!The function call correctly passes all new parameters including authentication credentials and the split_by_type flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test_main.py (1)
289-323: Pagination test still doesn't validate multi-batch retrieval.This test creates only 100 nodes, but the batch size in
main.pyis 10,000, so pagination logic (multipleSKIP/LIMITiterations) is never exercised. The previous review flagged this and was marked as addressed in commit bcf6139, but the test still doesn't validate multi-batch pagination.
🧹 Nitpick comments (1)
test_main.py (1)
298-299: Use specific exception type or log the error.The bare
exceptclause should specify the expected exception type (e.g.,except Exception as e:) or log the error for debugging. This applies to both cleanup blocks in this test.Apply this diff to improve error handling:
# Clear any existing data try: g.query("MATCH (n) DETACH DELETE n") - except: - pass + except Exception: + pass # Graph might not exist yetAnd similarly for the cleanup block at lines 320-323:
# Cleanup try: g.query("MATCH (n) DETACH DELETE n") - except: - pass + except Exception: + pass # Ignore cleanup errors
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test_main.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test_main.py (1)
main.py (1)
export_graph(7-137)
🪛 Ruff (0.14.2)
test_main.py
241-241: Unused method argument: temp_dir
(ARG002)
245-245: Do not assert blind exception: Exception
(B017)
248-248: Unused method argument: temp_dir
(ARG002)
257-257: Unused method argument: temp_dir
(ARG002)
275-275: Unused method argument: temp_dir
(ARG002)
289-289: Unused method argument: temp_dir
(ARG002)
298-298: Do not use bare except
(E722)
298-299: try-except-pass detected, consider logging the exception
(S110)
322-322: Do not use bare except
(E722)
322-323: try-except-pass detected, consider logging the exception
(S110)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.12)
🔇 Additional comments (3)
test_main.py (3)
241-246: LGTM! Connection error test improved.The change to test
localhost:1is more reliable than the previous approach. The static analysis warnings about unusedtemp_dirand blind exception are false positives—temp_diris used by pytest for its side effect (directory change and cleanup), and the blind exception assert is acceptable for verifying connection failures.
257-273: LGTM! Comprehensive combined mode test.This test properly validates the
split_by_type=Falsebehavior by checking for combined files, required columns (labelandtype), and minimum data counts. Good coverage.
275-287: LGTM! Good default behavior validation.This test correctly validates that
split_by_type=Trueis the default by verifying separate files are created and combined files are not. The negative assertions are particularly valuable.
| def test_export_with_auth_params(self, test_graph, temp_dir): | ||
| """Test that export_graph accepts username and password parameters.""" | ||
| # Test that function accepts auth parameters (even if server doesn't require them) | ||
| export_graph(test_graph, "localhost", 6379, username=None, password=None) | ||
|
|
||
| # Should create the expected files | ||
| assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created" | ||
| assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with actual username/password values.
This test passes username=None, password=None, which only verifies the function signature accepts auth parameters but doesn't test that credentials are actually used. Consider adding a test with real values (e.g., username="test_user", password="test_pass") to verify the authentication URL construction works correctly, even if the test server doesn't enforce auth.
Apply this diff to strengthen the test:
def test_export_with_auth_params(self, test_graph, temp_dir):
- """Test that export_graph accepts username and password parameters."""
- # Test that function accepts auth parameters (even if server doesn't require them)
- export_graph(test_graph, "localhost", 6379, username=None, password=None)
+ """Test that export_graph accepts and handles username and password parameters."""
+ # Test with actual auth parameters (even if server doesn't require them)
+ # This verifies the URL construction logic handles credentials correctly
+ export_graph(test_graph, "localhost", 6379, username="test_user", password="test_pass")
# Should create the expected files
assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created"
assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_export_with_auth_params(self, test_graph, temp_dir): | |
| """Test that export_graph accepts username and password parameters.""" | |
| # Test that function accepts auth parameters (even if server doesn't require them) | |
| export_graph(test_graph, "localhost", 6379, username=None, password=None) | |
| # Should create the expected files | |
| assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created" | |
| assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created" | |
| def test_export_with_auth_params(self, test_graph, temp_dir): | |
| """Test that export_graph accepts and handles username and password parameters.""" | |
| # Test with actual auth parameters (even if server doesn't require them) | |
| # This verifies the URL construction logic handles credentials correctly | |
| export_graph(test_graph, "localhost", 6379, username="test_user", password="test_pass") | |
| # Should create the expected files | |
| assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created" | |
| assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created" |
🧰 Tools
🪛 Ruff (0.14.2)
248-248: Unused method argument: temp_dir
(ARG002)
🤖 Prompt for AI Agents
In test_main.py around lines 248 to 255, the test only passes username=None and
password=None which doesn't verify that credentials are actually used; update
the test to also call export_graph with concrete credentials (e.g.,
username="test_user", password="test_pass"), run the export, and keep the
existing assertions that the expected files (nodes_Person.csv and
edges_WORKS_FOR.csv) are created; ensure you do this as a separate call within
the same test (or a new test) so the function's authentication parameters are
exercised without removing the original None-based check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test_main.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test_main.py (1)
main.py (1)
export_graph(7-137)
🪛 Ruff (0.14.2)
test_main.py
241-241: Unused method argument: temp_dir
(ARG002)
250-250: Unused method argument: temp_dir
(ARG002)
268-268: Unused method argument: temp_dir
(ARG002)
282-282: Unused method argument: temp_dir
(ARG002)
291-291: Do not use bare except
(E722)
291-292: try-except-pass detected, consider logging the exception
(S110)
315-315: Do not use bare except
(E722)
315-316: try-except-pass detected, consider logging the exception
(S110)
🔇 Additional comments (3)
test_main.py (3)
250-266: LGTM!This test properly validates the combined mode functionality by verifying both the presence of combined CSV files and the expected structure with label/type columns.
268-280: LGTM!This test correctly validates that
split_by_type=Trueis the default behavior by checking both positive cases (separate files exist) and negative cases (combined files don't exist).
241-248: ****This review comment is incorrect. The suggestion to add test calls with actual credentials (
username="test_user", password="test_pass") would fail against the real test FalkorDB instance, which has no authentication enabled. The tests use real FalkorDB connections (as shown by thefalkordb_connectionfixture connecting tolocalhost:6379), not mocks.The current test is appropriate for this integration test environment: it verifies the function accepts the credential parameters and successfully creates output files with
username=None, password=None, which correctly matches the test server's configuration. Theexport_graphfunction already contains proper credential URL construction logic (main.py lines 10-15) that handles all three cases (both credentials, username-only, and no credentials).To test credential handling with authentication would require either: (1) mocking the FalkorDB connection, (2) configuring test infrastructure with auth enabled, or (3) unit testing the URL construction logic directly. None of these changes are necessary given the current integration test approach.
Likely an incorrect or invalid review comment.
| def test_export_pagination_large_graph(self, falkordb_connection, temp_dir): | ||
| """Test that pagination works with larger datasets.""" | ||
| # Create a graph with more nodes than would fit in a single batch | ||
| large_graph_name = "large_test_graph" | ||
| g = falkordb_connection.select_graph(large_graph_name) | ||
|
|
||
| # Clear any existing data | ||
| try: | ||
| g.query("MATCH (n) DETACH DELETE n") | ||
| except: | ||
| pass | ||
|
|
||
| # Create 100 nodes (simulating pagination, though batch size is 10000) | ||
| g.query("UNWIND range(1, 100) AS i CREATE (:TestNode {id: i, name: 'Node' + toString(i)})") | ||
|
|
||
| # Create some edges | ||
| g.query("MATCH (a:TestNode), (b:TestNode) WHERE a.id < b.id AND b.id - a.id = 1 CREATE (a)-[:NEXT]->(b)") | ||
|
|
||
| export_graph(large_graph_name, "localhost", 6379) | ||
|
|
||
| # Verify all nodes were exported | ||
| assert os.path.exists("nodes_TestNode.csv"), "nodes_TestNode.csv should be created" | ||
| nodes_df = pd.read_csv("nodes_TestNode.csv") | ||
| assert len(nodes_df) == 100, f"Should have 100 nodes, got {len(nodes_df)}" | ||
|
|
||
| # Verify edges were exported | ||
| assert os.path.exists("edges_NEXT.csv"), "edges_NEXT.csv should be created" | ||
| edges_df = pd.read_csv("edges_NEXT.csv") | ||
| assert len(edges_df) == 99, f"Should have 99 edges, got {len(edges_df)}" | ||
|
|
||
| # Cleanup | ||
| try: | ||
| g.query("MATCH (n) DETACH DELETE n") | ||
| except: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Test doesn't validate multi-batch pagination.
Although marked as addressed in a previous review, this test still creates only 100 nodes while batch_size in main.py is 10,000. This means pagination logic (multiple iterations with SKIP/LIMIT) is never exercised—all data is fetched in a single batch.
Consider one of these approaches:
- Mock the batch_size to a smaller value (e.g., 10) for testing
- Rename the test to
test_export_moderate_datasetto reflect what it actually validates - Create a separate pagination-specific test with a mocked batch size
Option 1: Mock batch_size to properly test pagination
+ @pytest.fixture(autouse=False)
+ def mock_batch_size(self, monkeypatch):
+ """Mock batch size for pagination testing."""
+ import main
+ # Store original batch_size if it becomes a module-level constant
+ # For now, we'll need to patch it within the function
+ yield
+
def test_export_pagination_large_graph(self, falkordb_connection, temp_dir):
"""Test that pagination works with larger datasets."""
+ # Patch batch_size to test pagination with smaller dataset
+ import main
+ original_code = main.export_graph.__code__
+ # Note: Proper mocking would require refactoring main.py to make batch_size configurable
+
# Create a graph with more nodes than would fit in a single batch
large_graph_name = "large_test_graph"Better approach: Refactor main.py to accept batch_size as a parameter, then test with a small value like 10.
Looking at the review comment, I need to verify whether the test actually fails to exercise pagination logic. Let me gather information about the batch_size and export_graph implementation.
🏁 Script executed:
#!/bin/bash
# Find and examine main.py for batch_size and export_graph implementation
echo "=== Searching for batch_size definition ==="
rg -n "batch_size" --type=py -C 3
echo -e "\n=== Finding export_graph function ==="
ast-grep --pattern $'def export_graph($$$) {
$$$
}'
echo -e "\n=== Alternative: search for export_graph function definition ==="
rg -n "def export_graph" --type=py -A 30Length of output: 3254
🏁 Script executed:
#!/bin/bash
# Examine main.py for batch_size definition and export_graph implementation
echo "=== Searching for batch_size ==="
rg -n "batch_size" --type=py -C 2
echo -e "\n=== Searching for export_graph function ==="
rg -n "def export_graph" --type=py -A 40Length of output: 3261
Test does not exercise multi-batch pagination logic.
The review is correct: batch_size is hardcoded to 10,000 in main.py (lines 23, 81), and the test creates only 100 nodes. This means the pagination while loop executes once and exits (100 < 10,000), never testing the SKIP/LIMIT iteration behavior.
The proposed diff in the review is incomplete. Since batch_size is a hardcoded constant inside export_graph, it cannot be easily mocked without refactoring.
Recommended fixes:
- Refactor
export_graphto acceptbatch_sizeas a parameter with a sensible default (10,000), then pass a small value (e.g., 10) in this test. - Rename the test to
test_export_moderate_datasetto reflect that it validates correctness on a subset of data, not pagination behavior. - Create a separate
test_export_paginationtest only after refactoring to support configurable batch size.
🧰 Tools
🪛 Ruff (0.14.2)
282-282: Unused method argument: temp_dir
(ARG002)
291-291: Do not use bare except
(E722)
291-292: try-except-pass detected, consider logging the exception
(S110)
315-315: Do not use bare except
(E722)
315-316: try-except-pass detected, consider logging the exception
(S110)
Changes
Problem Solved
The exporter was previously limited to 10,000 nodes/edges due to FalkorDB's default result limit. This PR adds pagination to fetch all data in batches.
New Features
Summary by CodeRabbit
New Features
Documentation
Tests