Skip to content

Conversation

@shahar-biron
Copy link

@shahar-biron shahar-biron commented Oct 28, 2025

Changes

  • Add --username and --password options for FalkorDB authentication
  • Add --split-by-type/--no-split-by-type flags to control CSV file organization
  • Implement pagination to export all nodes and edges (fixes 10k limit issue)
  • Add progress feedback during export

Problem Solved

The exporter was previously limited to 10,000 nodes/edges due to FalkorDB's default result limit. This PR adds pagination to fetch all data in batches.

New Features

  • Authentication: Optional username/password for secure FalkorDB connections
  • Flexible output: Choose between separate files per type or combined files
  • Progress tracking: Real-time feedback on export progress

Summary by CodeRabbit

  • New Features

    • Optional authentication for exports and CLI flags for username/password and split-by-type
    • Configurable output: per-label/type CSVs (default) or consolidated nodes.csv/edges.csv
    • Batched, paginated exports for large graphs with per-label/type progress logging and final summary counts
  • Documentation

    • README and CLI docs updated with auth, pagination, split/combined examples
  • Tests

    • Added tests for auth, combined-mode, default split-mode, and pagination on large graphs

- Add --username and --password options for FalkorDB authentication
- Add --split-by-type/--no-split-by-type flags to control CSV file organization
- Implement pagination to export all nodes and edges (fixes 10k limit)
- Add progress feedback during export
@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Walkthrough

Export tool now accepts optional username/password, a split_by_type toggle, and connects via a constructed URL. Nodes and edges are retrieved in batched, paginated queries, accumulated by label/type, and written either to per-label/type CSVs or combined nodes.csv/edges.csv. CLI, README, and tests updated.

Changes

Cohort / File(s) Summary
Core export functionality
main.py
export_graph() signature changed to (..., username=None, password=None, split_by_type=True); FalkorDB connection created from a constructed URL; nodes and edges fetched in batched, paginated queries (last_id, batch_size=10000) with progress logging; accumulate by label/type; conditional per-label/per-type CSV outputs or combined nodes.csv/edges.csv; CLI options added: --username, --password, --split-by-type/--no-split-by-type.
Documentation / Examples
README.md
Usage, examples, and output descriptions updated to document authentication, pagination behavior, default per-label/per-type exports, and combined-file mode (--no-split-by-type).
Tests
test_main.py
Removed connection-error test; added tests for auth parameters, default split-by-type behavior, combined-file mode (split_by_type=False) validating label/type columns, and a pagination test for larger graph export; tests assert expected CSV files and basic row counts.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Export as export_graph
    participant DB as FalkorDB
    participant CSV as CSV Writer

    rect rgb(245,250,255)
    note over CLI,Export: CLI invokes export with auth & options
    CLI->>Export: graph_name, host, port, username?, password?, split_by_type
    end

    rect rgb(225,245,225)
    note over Export,DB: Batched Node Retrieval (last_id)
    loop while batch returned
        Export->>DB: Query nodes WHERE id > last_id LIMIT batch_size
        DB-->>Export: node batch
        Export->>Export: Accumulate nodes by label / log progress
    end
    end

    rect rgb(255,245,230)
    note over Export,CSV: Conditional Node Output
    alt split_by_type = True
        Export->>CSV: Write per-label node CSVs
    else
        Export->>CSV: Write combined `nodes.csv` (with label column)
    end
    end

    rect rgb(225,245,225)
    note over Export,DB: Batched Edge Retrieval (last_id)
    loop while batch returned
        Export->>DB: Query edges WHERE id > last_id LIMIT batch_size
        DB-->>Export: edge batch
        Export->>Export: Accumulate edges by type / log progress
    end
    end

    rect rgb(255,245,230)
    note over Export,CSV: Conditional Edge Output
    alt split_by_type = True
        Export->>CSV: Write per-type edge CSVs
    else
        Export->>CSV: Write combined `edges.csv` (with type column)
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

  • Inspect pagination loops and last_id handling to ensure no skips/duplicates.
  • Verify URL construction and safe handling/escaping of username/password.
  • Confirm CSV headers/naming and consistency across split vs combined modes.
  • Review added tests for deterministic setup/teardown and correct assertions.

Poem

I hop through batches, nibble rows, 🐇
Labels, types in tidy rows,
URL stitched, credentials set,
I print progress — no regret,
Carrot CSVs for all who chose.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Add auth support, split-by-type option, and pagination for large graphs" directly and accurately reflects the three main features being introduced in this changeset. The title identifies authentication support (username/password parameters), the split-by-type option for CSV output organization, and pagination to handle large graphs—all of which are confirmed as primary objectives in the PR summary. The title is concise, clear, and uses specific terminology that communicates the nature of the changes, allowing reviewers to understand the primary intent at a glance without being vague or overly broad.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-auth-pagination-features

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@shahar-biron shahar-biron requested a review from gkorland October 28, 2025 16:27
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 89.15663% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.43%. Comparing base (83d98fe) to head (64de99c).

Files with missing lines Patch % Lines
main.py 89.15% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #26      +/-   ##
==========================================
- Coverage   86.53%   85.43%   -1.11%     
==========================================
  Files           1        1              
  Lines          52      103      +51     
==========================================
+ Hits           45       88      +43     
- Misses          7       15       +8     
Flag Coverage Δ
unittests 85.43% <89.15%> (-1.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
main.py 85.43% <89.15%> (-1.11%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Document authentication options (--username, --password)
- Explain --split-by-type and --no-split-by-type modes
- Add comprehensive usage examples
- Document pagination support for large graphs
- Update output section with file naming patterns
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
main.py (3)

59-75: Memory usage still O(total). Stream to CSV per batch.

For very large graphs, accumulate-then-write is memory heavy. Append per batch with header once.

# per label/type file: write each batch
df.to_csv(filename, mode="a", index=False, header=not os.path.exists(filename))

Optionally precompute column union to stabilize headers.

Also applies to: 115-130


148-158: Simplify boolean flag with BooleanOptionalAction.

You don’t need two flags; argparse can generate --split-by-type/--no-split-by-type automatically.

-    parser.add_argument(
-        "--split-by-type",
-        action="store_true",
-        default=True,
-        help="Create separate CSV files per node label and edge type (default: True)"
-    )
-    parser.add_argument(
-        "--no-split-by-type",
-        dest="split_by_type",
-        action="store_false",
-        help="Create single CSV files for all nodes and edges"
-    )
+    parser.add_argument(
+        "--split-by-type",
+        dest="split_by_type",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="Create separate CSV files per node label and edge type (use --no-split-by-type to combine)"
+    )

Requires Python 3.9+.


59-64: Optional: mitigate CSV injection when opening in Excel.

If properties can start with =,+,-,@, consider prefixing dangerous values before writing.

def sanitize_cell(v):
    if isinstance(v, str) and v and v[0] in "=+-@":
        return "'" + v
    return v
# apply before DataFrame(...)
rows = [{k: sanitize_cell(v) for k, v in row.items()} for row in rows]

Also applies to: 115-121

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83d98fe and 22232b0.

📒 Files selected for processing (1)
  • main.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (3.12)
🔇 Additional comments (1)
main.py (1)

7-16: No issues found — FalkorDB auth kwargs are correct.

The FalkorDB Python client accepts host, port, username, and password as constructor kwargs, and the code correctly passes these parameters. No changes needed.

Comment on lines +25 to +28
while True:
nodes_result = g.ro_query(
f"MATCH (n) RETURN ID(n), labels(n), properties(n) SKIP {skip} LIMIT {batch_size}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Pagination via SKIP/LIMIT scales poorly at high offsets; consider ID-based pagination and parameters.

Large SKIPs become O(N). Prefer WHERE ID(x) > last_id ORDER BY ID(x) LIMIT $limit. Also parameterize skip/limit to avoid string interpolation.

Example pattern:

# Nodes
q = "MATCH (n) WHERE ID(n) > $last_id RETURN ID(n), labels(n), properties(n) ORDER BY ID(n) LIMIT $limit"
params = {"last_id": last_id, "limit": batch_size}
# Repeat updating last_id from final record

Also applies to: 83-86


🌐 Web query:

Does FalkorDB support parameterized queries in Python (ro_query with params)? Recommended pattern for paginating large result sets (ID-based) in FalkorDB/RedisGraph?

💡 Result:

Short answers

  • Yes — FalkorDB supports parameterized queries and you can pass params from Python. The GRAPH.QUERY docs show the parameterized syntax and a Python example; the read-only command GRAPH.RO_QUERY exists as well, and the client follows the same query/params pattern. [1][2]

Recommended pattern for paginating large result sets (ID / cursor based)

  • Use cursor (ID or indexed timestamp) pagination rather than OFFSET/LIMIT. Keep an indexed, monotonic property on nodes (e.g. created_at or a numeric id) and paginate with a WHERE clause plus ORDER BY and LIMIT:
    • Example Cypher (use your node property name instead of my_id):
      • MATCH (n:MyLabel) WHERE n.my_id > $cursor ORDER BY n.my_id ASC LIMIT $limit RETURN n
    • Pass cursor and limit as parameters (so queries are cached and safe). On the first page use cursor = 0 (or null-handling) and each page set cursor = last_row.my_id from the previous page.
  • Benefits: uses an index so the database jumps to the next range (fast, stable under inserts), avoids scanning/skipping rows (OFFSET) and reduces memory/IO pressure.
  • Caveats:
    • Ensure the paginating property is indexed and monotonic (or use created_at if strictly increasing).
    • If you need strict snapshot semantics across many pages, you’ll need additional application-level handling because cursor pagination is eventually consistent with concurrent writes.
    • Use FalkorDB configuration limits (RESULTSET_SIZE / timeouts) to avoid runaway queries on unexpectedly large requests. [3][4][5]

References

  1. GRAPH.QUERY (parameterized query examples, Python). [1] https://docs.falkordb.com/commands/graph.query.html
  2. GRAPH.RO_QUERY (read-only query command). [2] https://docs.falkordb.com/commands/graph.ro-query.html
  3. FalkorDB configuration (RESULTSET_SIZE, timeouts). [3] https://docs.falkordb.com/getting-started/configuration.html
  4. Pagination best-practices / avoid OFFSET (cursor/id pagination recommended). [4] https://configzen.com/blog/master-pagination-mysql-pitfalls
  5. Cursor-based pagination overview (why it’s preferred for large sets). [5] https://www.carens.dev/a-friendly-guide-to-pagination

If you want, I can show the exact Python call form for graph.ro_query with params using the FalkorDB Python client.


I'll verify the actual code in main.py to ensure the refactoring suggestion is applicable.

I need to inspect the actual code in main.py to verify the current implementation and determine whether the refactoring suggestion is accurate and appropriate.

[/run_scripts]

Once I have the code, I'll verify the exact pagination pattern and determine if the refactoring is applicable and correctly suggested.


🏁 Script executed:

#!/bin/bash
# Inspect the code at lines 25-28 and 83-86 in main.py
echo "=== Lines 25-28 ===" && sed -n '25,28p' main.py
echo ""
echo "=== Lines 83-86 ===" && sed -n '83,86p' main.py
echo ""
echo "=== Context around lines 25-28 (20-35) ===" && sed -n '20,35p' main.py
echo ""
echo "=== Context around lines 83-86 (75-95) ===" && sed -n '75,95p' main.py

Length of output: 1486


Replace SKIP/LIMIT with parameterized cursor-based pagination using indexed node/edge properties, not internal IDs.

FalkorDB's internal IDs are not immutable—higher IDs may be reassigned to lower positions after deletion, making ID-based cursors unreliable for pagination. Instead:

  1. Use indexed properties (e.g., created_at, uuid, or a stable node_id attribute) for cursor pagination.
  2. Parameterize queries to avoid string interpolation and improve performance.

For nodes (lines 25–28):

q = "MATCH (n) WHERE n.id > $last_id RETURN ID(n), labels(n), properties(n) ORDER BY n.id LIMIT $limit"
params = {"last_id": last_id, "limit": batch_size}
nodes_result = g.ro_query(q, params)

For edges (lines 83–86):

q = "MATCH (a)-[e]->(b) WHERE e.id > $last_id RETURN ID(e), TYPE(e), ID(a), ID(b), properties(e) ORDER BY e.id LIMIT $limit"
params = {"last_id": last_id, "limit": batch_size}
edges_result = g.ro_query(q, params)

Ensure the pagination property (n.id, e.id) is indexed and monotonically increasing.

Comment on lines +35 to +44
labels = record[1]
props = record[2] or {}

# Handle nodes with multiple labels or no labels
if labels:
for label in labels:
node = {"id": node_id}
node.update(props)
nodes_by_label[label].append(node)
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

labels(n) type guard to avoid iterating characters.

If labels(n) yields a string, the current loop will iterate per-char. Guard it.

-            labels = record[1]
+            labels = record[1] or []
+            if not isinstance(labels, (list, tuple)):
+                labels = [labels]
🤖 Prompt for AI Agents
In main.py around lines 35 to 44 the code may iterate a string label one
character at a time; before the loop ensure labels is treated as a sequence of
labels (not a str) — e.g. if labels is a str, wrap it as a single-item list (or
otherwise convert into the expected iterable) and then run the existing for
label in labels logic; keep the existing None/empty handling intact.

Comment on lines +41 to +48
node = {"id": node_id}
node.update(props)
nodes_by_label[label].append(node)
else:
# Handle nodes without labels
node = {"id": node_id}
node.update(props)
nodes_by_label[label].append(node)
else:
# Handle nodes without labels
node = {"id": node_id}
node.update(props)
nodes_by_label["unlabeled"].append(node)

# Export each node label to its own CSV file
for label, nodes in nodes_by_label.items():
filename = f"nodes_{label}.csv"
pd.DataFrame(nodes).to_csv(filename, index=False)
print(f"✅ Exported {len(nodes)} nodes with label '{label}' to {filename}")
nodes_by_label["unlabeled"].append(node)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Props can overwrite IDs — use non-colliding field names.

props may contain keys like 'id', overwriting your synthetic identifiers and corrupting exports.

Apply:

-                    node = {"id": node_id}
+                    node = {"node_id": node_id}
                     node.update(props)
...
-                node = {"id": node_id}
+                node = {"node_id": node_id}
                 node.update(props)
...
-            edge = {
-                "id": edge_id,
-                "from_id": from_id,
-                "to_id": to_id
-            }
+            edge = {
+                "edge_id": edge_id,
+                "from_id": from_id,
+                "to_id": to_id
+            }
             edge.update(props)

Also applies to: 98-105

🤖 Prompt for AI Agents
In main.py around lines 41 to 48 (and likewise at lines 98 to 105), props may
contain an "id" key and will overwrite the synthetic node_id when you call
node.update(props); fix by building nodes so props cannot collide with the
synthetic id—create the node dict with the synthetic id under "id" (or use a
reserved key like "_id") and merge props into a separate key or filter out "id"
from props before updating (e.g., copy props minus "id" or nest props under
"props") so the synthetic identifier is preserved.

Comment on lines +62 to +64
filename = f"nodes_{label}.csv"
pd.DataFrame(nodes).to_csv(filename, index=False)
print(f"✅ Exported {len(nodes)} nodes with label '{label}' to {filename}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Sanitize filenames to avoid OS/path issues.

Label/type may contain spaces or separators. Sanitize before composing filenames.

-            filename = f"nodes_{label}.csv"
+            filename = f"nodes_{safe_filename(label)}.csv"
...
-            filename = f"edges_{edge_type}.csv"
+            filename = f"edges_{safe_filename(edge_type)}.csv"

Add once near imports:

import re

def safe_filename(name: str) -> str:
    name = str(name).strip().replace(" ", "_")
    name = re.sub(r"[^A-Za-z0-9_.-]+", "_", name)
    return name[:200] or "unnamed"

Also applies to: 118-120

🤖 Prompt for AI Agents
In main.py around lines 62 to 64 (and also apply to lines 118 to 120), filenames
are composed directly from label/type which may contain spaces or unsafe
characters; add the provided safe_filename helper near the imports (and import
re) and use it when building filenames (e.g., filename =
f"nodes_{safe_filename(label)}.csv") so labels are sanitized, truncated to a
safe length, and invalid chars replaced.

Comment on lines +67 to +74
all_nodes = []
for label, nodes in nodes_by_label.items():
for node in nodes:
node["label"] = label
all_nodes.append(node)
if all_nodes:
pd.DataFrame(all_nodes).to_csv("nodes.csv", index=False)
print(f"✅ Exported {len(all_nodes)} nodes to nodes.csv")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid mutating records and avoid 'label'/'type' key collisions.

Use copy with reserved column names to prevent clobbering user props.

-        for label, nodes in nodes_by_label.items():
-            for node in nodes:
-                node["label"] = label
-                all_nodes.append(node)
+        for label, nodes in nodes_by_label.items():
+            for node in nodes:
+                all_nodes.append({**node, "node_label": label})
...
-        for edge_type, edges in edges_by_type.items():
-            for edge in edges:
-                edge["type"] = edge_type
-                all_edges.append(edge)
+        for edge_type, edges in edges_by_type.items():
+            for edge in edges:
+                all_edges.append({**edge, "edge_type": edge_type})

Also applies to: 124-129

🤖 Prompt for AI Agents
In main.py around lines 67-74 (and similarly 124-129) you are mutating the
original node dicts and potentially clobbering user keys named "label" or
"type"; instead, create a shallow copy of each node (e.g., node_copy =
node.copy()) and add a reserved column name (use a prefixed name unlikely to
collide, e.g., "_label" or "__label" and "_type"/"__type") to the copy before
appending; then write the list of copies to CSV so original records remain
unmodified and user properties are preserved.

Comment on lines +145 to +147
parser.add_argument("--username", help="FalkorDB username (optional)")
parser.add_argument("--password", help="FalkorDB password (optional)")
parser.add_argument(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Passwords on the command line leak via ps/history — prefer env var or prompt.

Accepting --password is risky. Fallback to env or interactive prompt when username is set.

import os, getpass
...
args = parser.parse_args()
if args.username and not args.password:
    args.password = os.getenv("FALKORDB_PASSWORD") or getpass.getpass("FalkorDB password: ")

Document FALKORDB_PASSWORD in README.

Also applies to: 160-163

🤖 Prompt for AI Agents
In main.py around lines 145-147 and 160-163, avoid accepting plaintext passwords
via the --password CLI flag; when a username is provided and no password was
supplied on the command line, read the password from the FALKORDB_PASSWORD
environment variable or prompt interactively using getpass.getpass() instead.
Update argument parsing to treat --password as optional (or remove it) and add
post-parse logic that if args.username and not args.password then set
args.password = os.getenv("FALKORDB_PASSWORD") or prompt the user; ensure
imports for os and getpass are added and that interactive prompting only occurs
when stdin is a TTY. Also update the README to document the FALKORDB_PASSWORD
environment variable and the interactive prompt behavior.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
README.md (5)

97-101: Tighten auth examples to avoid teaching insecure habits.

Great examples. Add a one-line caution under the auth examples to note CLI password exposure; consider showing a safer pattern (e.g., prompt) if the tool supports it.

Suggested tweak:

 **Export with authentication:**
 ```bash
-python main.py social --username myuser --password mypass
+python main.py social --username myuser --password '<enter-password>'

+Note: Passing passwords on the command line can leak via history/ps. Prefer an interactive prompt or environment configuration if supported.



Also applies to: 102-106, 107-111, 112-116, 117-121

---

`125-143`: **Document how unlabeled nodes are handled in split mode.**

If unlabeled nodes exist, specify the filename convention (e.g., nodes_UNLABELED.csv) and whether multiple labels produce one row per node (union of labels) or a label-joined representation.



Proposed addition:

```diff
 #### Node files: `nodes_<label>.csv`
 Each node label gets its own file (e.g., `nodes_Person.csv`, `nodes_Company.csv`):
 - `id`: Node ID
 - Additional columns for each node property
+Notes:
+- Unlabeled nodes are written to `nodes_UNLABELED.csv` (if present).
+- Nodes with multiple labels appear in each corresponding label file.

153-160: Clarify column unification and missing values in combined mode.

State that the schema is the union of properties across labels/types and that missing properties are left blank/NA, so users know how to parse.

Suggested text:

 #### nodes.csv
 All nodes in a single file:
 - `id`: Node ID
 - `label`: Node label/type
 - Additional columns for each node property
+Schema notes: columns are the union of all encountered node properties; properties absent on a node are left empty.
@@
 #### edges.csv  
 All edges in a single file:
 - `id`: Edge ID
 - `type`: Relationship type
 - `from_id`: Source node ID
 - `to_id`: Target node ID
 - Additional columns for each edge property
+Schema notes: columns are the union of all encountered edge properties; properties absent on an edge are left empty.

161-164: Add --batch-size CLI flag for configurable pagination (currently hardcoded to 10,000).

Verified: batch_size is hardcoded to 10,000 in main.py (lines 21, 79) with no CLI or environment variable override. While this matches the README documentation, adding a --batch-size flag would improve usability for users with memory constraints or specific performance requirements.


76-93: Add documentation clarifications for flag exclusivity and password security.

The code correctly implements mutually exclusive flags via dest="split_by_type" with opposite actions (store_true vs. store_false), with default=True. However, the README help text does not clarify this relationship or the default behavior. Additionally, the implementation passes passwords directly from the CLI with no environment variable or interactive input alternative, making the security warning in the proposed diff valid and actionable.

Apply the suggested documentation update to clarify flag behavior and add the security note.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22232b0 and a7005f6.

📒 Files selected for processing (1)
  • README.md (2 hunks)
🔇 Additional comments (2)
README.md (2)

20-22: Nice additions to call out.

Feature bullets align with the PR scope; no issues.


80-81: Wording LGTM.

“by label/type” matches the two output modes described below.

- Test authentication parameter acceptance
- Test split-by-type mode (both True and False)
- Test combined mode output (single nodes.csv and edges.csv)
- Test default behavior (split mode)
- Test pagination with larger graphs (100 nodes)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
test_main.py (3)

247-254: Test doesn't validate authentication functionality.

The test passes None for both username and password, which only verifies that the function accepts these parameters but doesn't test actual authentication behavior. Consider adding test cases that:

  • Use real credentials when the FalkorDB test instance has authentication enabled
  • Test authentication failure scenarios with invalid credentials

256-272: Use exact counts for more precise assertions.

The assertions use >= for minimum counts, but since the test_graph fixture creates a deterministic dataset (exactly 3 nodes and 2 edges), the test should assert exact values for more precision.

Apply this diff to strengthen the assertions:

-        assert len(nodes_df) >= 3, "Should have at least 3 nodes (2 Person + 1 Company)"
+        assert len(nodes_df) == 3, "Should have exactly 3 nodes (2 Person + 1 Company)"
         
         # Verify edges.csv has type column
         edges_df = pd.read_csv("edges.csv")
         assert "type" in edges_df.columns, "edges.csv should have type column"
-        assert len(edges_df) >= 2, "Should have at least 2 edges (WORKS_FOR + KNOWS)"
+        assert len(edges_df) == 2, "Should have exactly 2 edges (WORKS_FOR + KNOWS)"

297-298: Replace bare except with specific exceptions.

The bare except clauses on lines 297 and 321 are overly broad and can hide unexpected errors. Additionally, silently ignoring exceptions makes debugging harder.

Apply this diff to catch specific exceptions:

         # Clear any existing data
         try:
             g.query("MATCH (n) DETACH DELETE n")
-        except:
-            pass
+        except Exception as e:
+            # Graph might not exist yet or be empty
+            print(f"Warning: Failed to clear graph: {e}")
         
         # Create 100 nodes (simulating pagination, though batch size is 10000)

And similarly for the cleanup section:

         # Cleanup
         try:
             g.query("MATCH (n) DETACH DELETE n")
-        except:
-            pass
+        except Exception as e:
+            print(f"Warning: Failed to cleanup graph: {e}")

Also applies to: 321-322

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7005f6 and 53f16a3.

📒 Files selected for processing (1)
  • test_main.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test_main.py (1)
main.py (1)
  • export_graph (7-135)
🪛 Ruff (0.14.2)
test_main.py

247-247: Unused method argument: temp_dir

(ARG002)


256-256: Unused method argument: temp_dir

(ARG002)


274-274: Unused method argument: temp_dir

(ARG002)


288-288: Unused method argument: temp_dir

(ARG002)


297-297: Do not use bare except

(E722)


297-298: try-except-pass detected, consider logging the exception

(S110)


321-321: Do not use bare except

(E722)


321-322: try-except-pass detected, consider logging the exception

(S110)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (3.12)
🔇 Additional comments (1)
test_main.py (1)

274-286: LGTM!

Good test coverage for verifying the default split-by-type behavior, including negative assertions to ensure combined files aren't created.

import pandas as pd

def export_graph(graph_name, host, port):
def export_graph(graph_name, host, port, username=None, password=None, split_by_type=True):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use url instead


while True:
nodes_result = g.ro_query(
f"MATCH (n) RETURN ID(n), labels(n), properties(n) SKIP {skip} LIMIT {batch_size}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be faster if instead of skip filter wit id(n) > last id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is not necessarily sortable... I think current approach is "good enough"...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query specifically under the hood is scan the nodes by id
it will be slow when the graph is very large

…nation

- Use redis:// URL format for connection instead of separate parameters
- Replace SKIP-based pagination with ID-based filtering (WHERE ID(n) > last_id)
- Add ORDER BY ID(n)/ID(e) for consistent pagination
- Improves performance on large graphs by avoiding expensive SKIP operations
The ID-based pagination approach failed because FalkorDB/Cypher doesn't support
ORDER BY on ID() function. Reverting to SKIP-based pagination while keeping the
URL connection improvement from AviAvni's review.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (8)
main.py (8)

27-31: Parameterize queries instead of string interpolation.

Avoid f-string SQL/Cypher; improves safety and plan caching.

-        nodes_result = g.ro_query(
-            f"MATCH (n) WHERE ID(n) > {last_id} RETURN ID(n), labels(n), properties(n) ORDER BY ID(n) LIMIT {batch_size}"
-        )
+        nodes_query = (
+            "MATCH (n) WHERE ID(n) > $last_id "
+            "RETURN ID(n), labels(n), properties(n) "
+            "ORDER BY ID(n) LIMIT $limit"
+        )
+        nodes_result = g.ro_query(nodes_query, {"last_id": last_id, "limit": batch_size})
-        edges_result = g.ro_query(
-            f"MATCH (a)-[e]->(b) WHERE ID(e) > {last_id} RETURN ID(e), TYPE(e), ID(a), ID(b), properties(e) ORDER BY ID(e) LIMIT {batch_size}"
-        )
+        edges_query = (
+            "MATCH (a)-[e]->(b) WHERE ID(e) > $last_id "
+            "RETURN ID(e), TYPE(e), ID(a), ID(b), properties(e) "
+            "ORDER BY ID(e) LIMIT $limit"
+        )
+        edges_result = g.ro_query(edges_query, {"last_id": last_id, "limit": batch_size})

Also applies to: 87-91


66-66: Sanitize filenames derived from labels/types.

Avoid spaces/unsafe chars in filenames.

Add helper near imports:

+import re
+
+def safe_filename(name: str) -> str:
+    name = str(name).strip().replace(" ", "_")
+    name = re.sub(r"[^A-Za-z0-9_.-]+", "_", name)
+    return name[:200] or "unnamed"

Use it:

-            filename = f"nodes_{label}.csv"
+            filename = f"nodes_{safe_filename(label)}.csv"
-            filename = f"edges_{edge_type}.csv"
+            filename = f"edges_{safe_filename(edge_type)}.csv"

Also applies to: 124-124


71-75: Don’t mutate records; avoid 'label'/'type' collisions.

Use copies and reserved column names to prevent clobbering user props.

-        all_nodes = []
-        for label, nodes in nodes_by_label.items():
-            for node in nodes:
-                node["label"] = label
-                all_nodes.append(node)
+        all_nodes = []
+        for label, nodes in nodes_by_label.items():
+            for node in nodes:
+                all_nodes.append({**node, "node_label": label})
-        all_edges = []
-        for edge_type, edges in edges_by_type.items():
-            for edge in edges:
-                edge["type"] = edge_type
-                all_edges.append(edge)
+        all_edges = []
+        for edge_type, edges in edges_by_type.items():
+            for edge in edges:
+                all_edges.append({**edge, "edge_type": edge_type})

Also applies to: 129-133


27-61: Cursor pagination over internal IDs can miss/duplicate rows; use a monotonic, indexed property.

Internal IDs are not guaranteed stable across deletes; prefer WHERE n.my_cursor > $cursor ORDER BY n.my_cursor LIMIT $limit (same for edges).

If you have a property like created_at or uid, switch to it:

# Example (nodes)
q = ("MATCH (n) WHERE EXISTS(n.uid) AND n.uid > $cursor "
     "RETURN ID(n), labels(n), properties(n) ORDER BY n.uid LIMIT $limit")
params = {"cursor": last_uid, "limit": batch_size}

Confirm which properties are suitable and indexed.

Also applies to: 87-119


38-45: Guard labels type to avoid iterating characters.

labels can be a string; iterating would split into chars.

-            labels = record[1]
+            labels = record[1] or []
+            if not isinstance(labels, (list, tuple)):
+                labels = [labels]

44-51: Protect synthetic IDs from collision with props.

props may contain 'id'; node.update(props) will overwrite it.

-                    node = {"id": node_id}
-                    node.update(props)
+                    node = {"node_id": node_id}
+                    node.update({k: v for k, v in props.items() if k != "node_id"})
...
-                node = {"id": node_id}
-                node.update(props)
-                nodes_by_label["unlabeled"].append(node)
+                node = {"node_id": node_id}
+                node.update({k: v for k, v in props.items() if k != "node_id"})
+                nodes_by_label["__unlabeled__"].append(node)

103-109: Edge props can overwrite IDs; use non‑colliding keys.

Same collision risk as nodes.

-            edge = {
-                "id": edge_id,
-                "from_id": from_id,
-                "to_id": to_id
-            }
-            edge.update(props)
+            edge = {
+                "edge_id": edge_id,
+                "from_id": from_id,
+                "to_id": to_id
+            }
+            edge.update({k: v for k, v in props.items() if k not in ("edge_id", "from_id", "to_id")})

151-163: Avoid plaintext --password; support env var or interactive prompt.

CLI args leak via ps/history. Prefer FALKORDB_PASSWORD or getpass when username is set.

     args = parser.parse_args()
 
-    export_graph(args.graph_name, args.host, args.port, args.username, args.password, args.split_by_type)
+    import os, getpass, sys
+    if args.username and not args.password:
+        args.password = os.getenv("FALKORDB_PASSWORD")
+        if not args.password and sys.stdin.isatty():
+            args.password = getpass.getpass("FalkorDB password: ")
+    export_graph(args.graph_name, args.host, args.port, args.username, args.password, args.split_by_type)

Also document FALKORDB_PASSWORD in README.

Also applies to: 166-168

🧹 Nitpick comments (2)
main.py (2)

62-79: Memory scale: stream to CSV instead of accumulating whole graph.

Very large graphs will blow memory. Write per batch (append mode) and clear buffers.

I can provide a targeted patch that flushes each batch to CSV with header=not exists and mode="a" per label/type, keeping memory bounded.

Also applies to: 120-136


154-164: Simplify flags with BooleanOptionalAction; consider exposing batch size.

Cleaner UX and configurability.

  • Replace the paired flags with:
    • parser.add_argument("--split-by-type", action=argparse.BooleanOptionalAction, default=True)
  • Optionally add:
    • parser.add_argument("--batch-size", type=int, default=10000)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53f16a3 and bcf6139.

📒 Files selected for processing (1)
  • main.py (2 hunks)

Comment on lines +10 to +15
if username and password:
url = f"redis://{username}:{password}@{host}:{port}"
elif username:
url = f"redis://{username}@{host}:{port}"
else:
url = f"redis://{host}:{port}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Percent‑encode credentials in URL (special chars break the URI).

Usernames/passwords containing @/:?#& etc. will yield invalid URLs.

Apply:

-    if username and password:
-        url = f"redis://{username}:{password}@{host}:{port}"
-    elif username:
-        url = f"redis://{username}@{host}:{port}"
+    if username and password:
+        u = quote(username, safe="")
+        p = quote(password, safe="")
+        url = f"redis://{u}:{p}@{host}:{port}"
+    elif username:
+        u = quote(username, safe="")
+        url = f"redis://{u}@{host}:{port}"
     else:
         url = f"redis://{host}:{port}"

Add import once:

-from collections import defaultdict
+from collections import defaultdict
+from urllib.parse import quote
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if username and password:
url = f"redis://{username}:{password}@{host}:{port}"
elif username:
url = f"redis://{username}@{host}:{port}"
else:
url = f"redis://{host}:{port}"
if username and password:
u = quote(username, safe="")
p = quote(password, safe="")
url = f"redis://{u}:{p}@{host}:{port}"
elif username:
u = quote(username, safe="")
url = f"redis://{u}@{host}:{port}"
else:
url = f"redis://{host}:{port}"
🤖 Prompt for AI Agents
In main.py around lines 10-15, the code builds a redis URL with raw
username/password which breaks when credentials contain reserved chars; import
urllib.parse.quote_plus once at the top and use it to percent-encode username
and password (e.g. encoded_user = quote_plus(username, safe='') and encoded_pass
= quote_plus(password, safe='')) before inserting into the f-strings, keeping
the same conditional branches (use encoded_user and encoded_pass where
appropriate) so the generated redis:// URL is always valid.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
main.py (3)

7-17: Consider adding connection error handling.

The function directly connects to FalkorDB without handling potential connection failures (invalid credentials, unreachable host, etc.). Adding a try-except block would provide better user feedback.

Example:

try:
    db = FalkorDB(url=url)
    g = db.select_graph(graph_name)
except Exception as e:
    print(f"❌ Failed to connect to FalkorDB: {e}")
    raise

23-23: Consider extracting batch size as a module constant.

The batch size 10000 is hardcoded in both node and edge pagination loops (lines 23 and 81). Defining it once as a module-level constant would improve maintainability.

+BATCH_SIZE = 10000
+
 def export_graph(graph_name, host, port, username=None, password=None, split_by_type=True):
     ...
-    batch_size = 10000
+    batch_size = BATCH_SIZE

149-154: Redundant default in --split-by-type argument.

The default=True is unnecessary when using action="store_true" because argparse already defaults unset boolean flags to False, and the --no-split-by-type flag handles the False case.

     parser.add_argument(
         "--split-by-type",
         action="store_true",
-        default=True,
         help="Create separate CSV files per node label and edge type (default: True)"
     )

Note: The current implementation works correctly; this is just a minor cleanup.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcf6139 and 991a69b.

📒 Files selected for processing (1)
  • main.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (3.12)
🔇 Additional comments (1)
main.py (1)

164-164: LGTM!

The function call correctly passes all new parameters including authentication credentials and the split_by_type flag.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test_main.py (1)

289-323: Pagination test still doesn't validate multi-batch retrieval.

This test creates only 100 nodes, but the batch size in main.py is 10,000, so pagination logic (multiple SKIP/LIMIT iterations) is never exercised. The previous review flagged this and was marked as addressed in commit bcf6139, but the test still doesn't validate multi-batch pagination.

🧹 Nitpick comments (1)
test_main.py (1)

298-299: Use specific exception type or log the error.

The bare except clause should specify the expected exception type (e.g., except Exception as e:) or log the error for debugging. This applies to both cleanup blocks in this test.

Apply this diff to improve error handling:

         # Clear any existing data
         try:
             g.query("MATCH (n) DETACH DELETE n")
-        except:
-            pass
+        except Exception:
+            pass  # Graph might not exist yet

And similarly for the cleanup block at lines 320-323:

         # Cleanup
         try:
             g.query("MATCH (n) DETACH DELETE n")
-        except:
-            pass
+        except Exception:
+            pass  # Ignore cleanup errors
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 991a69b and 1f7cded.

📒 Files selected for processing (1)
  • test_main.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test_main.py (1)
main.py (1)
  • export_graph (7-137)
🪛 Ruff (0.14.2)
test_main.py

241-241: Unused method argument: temp_dir

(ARG002)


245-245: Do not assert blind exception: Exception

(B017)


248-248: Unused method argument: temp_dir

(ARG002)


257-257: Unused method argument: temp_dir

(ARG002)


275-275: Unused method argument: temp_dir

(ARG002)


289-289: Unused method argument: temp_dir

(ARG002)


298-298: Do not use bare except

(E722)


298-299: try-except-pass detected, consider logging the exception

(S110)


322-322: Do not use bare except

(E722)


322-323: try-except-pass detected, consider logging the exception

(S110)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (3.12)
🔇 Additional comments (3)
test_main.py (3)

241-246: LGTM! Connection error test improved.

The change to test localhost:1 is more reliable than the previous approach. The static analysis warnings about unused temp_dir and blind exception are false positives—temp_dir is used by pytest for its side effect (directory change and cleanup), and the blind exception assert is acceptable for verifying connection failures.


257-273: LGTM! Comprehensive combined mode test.

This test properly validates the split_by_type=False behavior by checking for combined files, required columns (label and type), and minimum data counts. Good coverage.


275-287: LGTM! Good default behavior validation.

This test correctly validates that split_by_type=True is the default by verifying separate files are created and combined files are not. The negative assertions are particularly valuable.

Comment on lines +248 to +255
def test_export_with_auth_params(self, test_graph, temp_dir):
"""Test that export_graph accepts username and password parameters."""
# Test that function accepts auth parameters (even if server doesn't require them)
export_graph(test_graph, "localhost", 6379, username=None, password=None)

# Should create the expected files
assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created"
assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test with actual username/password values.

This test passes username=None, password=None, which only verifies the function signature accepts auth parameters but doesn't test that credentials are actually used. Consider adding a test with real values (e.g., username="test_user", password="test_pass") to verify the authentication URL construction works correctly, even if the test server doesn't enforce auth.

Apply this diff to strengthen the test:

     def test_export_with_auth_params(self, test_graph, temp_dir):
-        """Test that export_graph accepts username and password parameters."""
-        # Test that function accepts auth parameters (even if server doesn't require them)
-        export_graph(test_graph, "localhost", 6379, username=None, password=None)
+        """Test that export_graph accepts and handles username and password parameters."""
+        # Test with actual auth parameters (even if server doesn't require them)
+        # This verifies the URL construction logic handles credentials correctly
+        export_graph(test_graph, "localhost", 6379, username="test_user", password="test_pass")
         
         # Should create the expected files
         assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created"
         assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_export_with_auth_params(self, test_graph, temp_dir):
"""Test that export_graph accepts username and password parameters."""
# Test that function accepts auth parameters (even if server doesn't require them)
export_graph(test_graph, "localhost", 6379, username=None, password=None)
# Should create the expected files
assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created"
assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created"
def test_export_with_auth_params(self, test_graph, temp_dir):
"""Test that export_graph accepts and handles username and password parameters."""
# Test with actual auth parameters (even if server doesn't require them)
# This verifies the URL construction logic handles credentials correctly
export_graph(test_graph, "localhost", 6379, username="test_user", password="test_pass")
# Should create the expected files
assert os.path.exists("nodes_Person.csv"), "nodes_Person.csv should be created"
assert os.path.exists("edges_WORKS_FOR.csv"), "edges_WORKS_FOR.csv should be created"
🧰 Tools
🪛 Ruff (0.14.2)

248-248: Unused method argument: temp_dir

(ARG002)

🤖 Prompt for AI Agents
In test_main.py around lines 248 to 255, the test only passes username=None and
password=None which doesn't verify that credentials are actually used; update
the test to also call export_graph with concrete credentials (e.g.,
username="test_user", password="test_pass"), run the export, and keep the
existing assertions that the expected files (nodes_Person.csv and
edges_WORKS_FOR.csv) are created; ensure you do this as a separate call within
the same test (or a new test) so the function's authentication parameters are
exercised without removing the original None-based check.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f7cded and 64de99c.

📒 Files selected for processing (1)
  • test_main.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test_main.py (1)
main.py (1)
  • export_graph (7-137)
🪛 Ruff (0.14.2)
test_main.py

241-241: Unused method argument: temp_dir

(ARG002)


250-250: Unused method argument: temp_dir

(ARG002)


268-268: Unused method argument: temp_dir

(ARG002)


282-282: Unused method argument: temp_dir

(ARG002)


291-291: Do not use bare except

(E722)


291-292: try-except-pass detected, consider logging the exception

(S110)


315-315: Do not use bare except

(E722)


315-316: try-except-pass detected, consider logging the exception

(S110)

🔇 Additional comments (3)
test_main.py (3)

250-266: LGTM!

This test properly validates the combined mode functionality by verifying both the presence of combined CSV files and the expected structure with label/type columns.


268-280: LGTM!

This test correctly validates that split_by_type=True is the default behavior by checking both positive cases (separate files exist) and negative cases (combined files don't exist).


241-248: ****

This review comment is incorrect. The suggestion to add test calls with actual credentials (username="test_user", password="test_pass") would fail against the real test FalkorDB instance, which has no authentication enabled. The tests use real FalkorDB connections (as shown by the falkordb_connection fixture connecting to localhost:6379), not mocks.

The current test is appropriate for this integration test environment: it verifies the function accepts the credential parameters and successfully creates output files with username=None, password=None, which correctly matches the test server's configuration. The export_graph function already contains proper credential URL construction logic (main.py lines 10-15) that handles all three cases (both credentials, username-only, and no credentials).

To test credential handling with authentication would require either: (1) mocking the FalkorDB connection, (2) configuring test infrastructure with auth enabled, or (3) unit testing the URL construction logic directly. None of these changes are necessary given the current integration test approach.

Likely an incorrect or invalid review comment.

Comment on lines +282 to +316
def test_export_pagination_large_graph(self, falkordb_connection, temp_dir):
"""Test that pagination works with larger datasets."""
# Create a graph with more nodes than would fit in a single batch
large_graph_name = "large_test_graph"
g = falkordb_connection.select_graph(large_graph_name)

# Clear any existing data
try:
g.query("MATCH (n) DETACH DELETE n")
except:
pass

# Create 100 nodes (simulating pagination, though batch size is 10000)
g.query("UNWIND range(1, 100) AS i CREATE (:TestNode {id: i, name: 'Node' + toString(i)})")

# Create some edges
g.query("MATCH (a:TestNode), (b:TestNode) WHERE a.id < b.id AND b.id - a.id = 1 CREATE (a)-[:NEXT]->(b)")

export_graph(large_graph_name, "localhost", 6379)

# Verify all nodes were exported
assert os.path.exists("nodes_TestNode.csv"), "nodes_TestNode.csv should be created"
nodes_df = pd.read_csv("nodes_TestNode.csv")
assert len(nodes_df) == 100, f"Should have 100 nodes, got {len(nodes_df)}"

# Verify edges were exported
assert os.path.exists("edges_NEXT.csv"), "edges_NEXT.csv should be created"
edges_df = pd.read_csv("edges_NEXT.csv")
assert len(edges_df) == 99, f"Should have 99 edges, got {len(edges_df)}"

# Cleanup
try:
g.query("MATCH (n) DETACH DELETE n")
except:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Test doesn't validate multi-batch pagination.

Although marked as addressed in a previous review, this test still creates only 100 nodes while batch_size in main.py is 10,000. This means pagination logic (multiple iterations with SKIP/LIMIT) is never exercised—all data is fetched in a single batch.

Consider one of these approaches:

  1. Mock the batch_size to a smaller value (e.g., 10) for testing
  2. Rename the test to test_export_moderate_dataset to reflect what it actually validates
  3. Create a separate pagination-specific test with a mocked batch size

Option 1: Mock batch_size to properly test pagination

+    @pytest.fixture(autouse=False)
+    def mock_batch_size(self, monkeypatch):
+        """Mock batch size for pagination testing."""
+        import main
+        # Store original batch_size if it becomes a module-level constant
+        # For now, we'll need to patch it within the function
+        yield
+
     def test_export_pagination_large_graph(self, falkordb_connection, temp_dir):
         """Test that pagination works with larger datasets."""
+        # Patch batch_size to test pagination with smaller dataset
+        import main
+        original_code = main.export_graph.__code__
+        # Note: Proper mocking would require refactoring main.py to make batch_size configurable
+        
         # Create a graph with more nodes than would fit in a single batch
         large_graph_name = "large_test_graph"

Better approach: Refactor main.py to accept batch_size as a parameter, then test with a small value like 10.


Looking at the review comment, I need to verify whether the test actually fails to exercise pagination logic. Let me gather information about the batch_size and export_graph implementation.


🏁 Script executed:

#!/bin/bash
# Find and examine main.py for batch_size and export_graph implementation
echo "=== Searching for batch_size definition ==="
rg -n "batch_size" --type=py -C 3

echo -e "\n=== Finding export_graph function ==="
ast-grep --pattern $'def export_graph($$$) {
  $$$
}'

echo -e "\n=== Alternative: search for export_graph function definition ==="
rg -n "def export_graph" --type=py -A 30

Length of output: 3254


🏁 Script executed:

#!/bin/bash
# Examine main.py for batch_size definition and export_graph implementation
echo "=== Searching for batch_size ==="
rg -n "batch_size" --type=py -C 2

echo -e "\n=== Searching for export_graph function ==="
rg -n "def export_graph" --type=py -A 40

Length of output: 3261


Test does not exercise multi-batch pagination logic.

The review is correct: batch_size is hardcoded to 10,000 in main.py (lines 23, 81), and the test creates only 100 nodes. This means the pagination while loop executes once and exits (100 < 10,000), never testing the SKIP/LIMIT iteration behavior.

The proposed diff in the review is incomplete. Since batch_size is a hardcoded constant inside export_graph, it cannot be easily mocked without refactoring.

Recommended fixes:

  1. Refactor export_graph to accept batch_size as a parameter with a sensible default (10,000), then pass a small value (e.g., 10) in this test.
  2. Rename the test to test_export_moderate_dataset to reflect that it validates correctness on a subset of data, not pagination behavior.
  3. Create a separate test_export_pagination test only after refactoring to support configurable batch size.
🧰 Tools
🪛 Ruff (0.14.2)

282-282: Unused method argument: temp_dir

(ARG002)


291-291: Do not use bare except

(E722)


291-292: try-except-pass detected, consider logging the exception

(S110)


315-315: Do not use bare except

(E722)


315-316: try-except-pass detected, consider logging the exception

(S110)

@shahar-biron shahar-biron requested a review from AviAvni October 29, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants