Skip to content

Conversation

@bveeramani
Copy link
Member

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner December 23, 2025 07:35
# distributed==2023.6.1; python_version < '3.12'
# dask[complete]==2025.5.0; python_version >= '3.12'
# distributed==2025.5.0; python_version >= '3.12'
# # pymars>=0.8.3; python_version < "3.12"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test requirements commented out, no packages installed

The entire data-test-requirements.txt file has all package dependencies commented out. Every line is either a comment (prefixed with #) or blank. When install-dependencies.sh installs this file with DATA_PROCESSING_TESTING=1, no packages will actually be installed. This will cause data processing tests to fail with missing import errors for packages like tensorflow-datasets, pytest-repeat, pytest-mock, fastavro, google-cloud-bigquery, and many others that tests depend on.

Fix in Cursor Fix in Web

# # Install MongoDB
# sudo apt-get purge -y mongodb*
# sudo apt-get install -y mongodb
# sudo rm -rf /var/lib/mongodb/mongod.lock
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MongoDB installation commented out breaks database tests

The MongoDB installation code is commented out in the Dockerfile, but the datamongo.build.wanda.yaml config still uses ARROW_MONGO_VERSION=0.5.* and the Docker image is expected to support MongoDB tests. This will cause MongoDB-related data tests to fail due to the database not being installed in the container.

Fix in Cursor Fix in Web

# if [[ -n "$ARROW_MONGO_VERSION" ]]; then
# # Older versions of Arrow Mongo require an older version of NumPy.
# pip install numpy==1.23.5
# fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy version pinning for Arrow Mongo removed

The NumPy version pinning code for older Arrow Mongo versions is commented out. The datamongo.build.wanda.yaml still uses ARROW_MONGO_VERSION=0.5.*, and the comment explicitly states "Older versions of Arrow Mongo require an older version of NumPy." Without this pip install numpy==1.23.5, pymongoarrow==0.5.* may encounter compatibility issues or import errors with newer NumPy versions that get installed by default.

Fix in Cursor Fix in Web

@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Dec 23, 2025
@bveeramani bveeramani marked this pull request as draft December 23, 2025 07:42
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CI configuration for data tests by switching to a new base Docker image (oss-ci-base_test) and standardizing the build steps in the Dockerfiles. While these changes improve consistency, there's a critical issue in python/requirements/ml/data-test-requirements.txt where all dependencies have been commented out, which will likely cause the data tests to fail. Additionally, there's some commented-out code in ci/docker/data.build.Dockerfile that should be removed for better maintainability.

Comment on lines +1 to +109
# numpy
# pandas
# pyarrow
# polars
# dask
# modin
# python-snappy

# # Integrations
# pyiceberg[sql-sqlite]==0.10.0
# delta-sharing
# hudi==0.4.0

# ## Lance
# pylance==0.22

# ## Avro
# fastavro

# ## Clickhouse
# clickhouse-connect



# # Webdataset
# msgpack
# webdataset

# # Mongo
# pymongo
# pymongoarrow
# bson


# # Snowflake
# snowflake-connector-python>=3.15.0

# # BigQueryDatasource
# google-cloud-bigquery
# google-cloud-core
# google-cloud-bigquery-storage
# google-api-core

# # KafkaDatasource
# kafka-python

# # MCAPDatasource
# mcap


# # TorchDatasource
# torch

# # TorchVision preprocessor
# torchvision


# # `test_sql`
# pymysql

# # TFRecords
# tensorflow
# tensorflow-datasets==4.9.3
# tensorflow-metadata
# crc32c==2.3
# # tfx-bsl

# pyspark
# transformers
# datasets>=3.0.2

# Pillow
# soundfile
# # decord
# av

# fsspec
# requests
# aiohttp
# pydantic
# packaging
# cloudpickle
# rich
# tqdm
# IPython
# ipywidgets
# PyYAML
# datasketches
# getdaft==0.4.3
# pytest
# testcontainers[kafka]


# pytest-repeat
# pytest-mock



# raydp==1.7.0b20250423.dev0

# deltalake==0.9.0

# pybase64

# dask[complete]==2023.6.1; python_version < '3.12'
# distributed==2023.6.1; python_version < '3.12'
# dask[complete]==2025.5.0; python_version >= '3.12'
# distributed==2025.5.0; python_version >= '3.12'
# # pymars>=0.8.3; python_version < "3.12"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

All packages in this requirements file have been commented out. This file is used to install dependencies for data processing tests when DATA_PROCESSING_TESTING=1 is set. With all lines commented, no dependencies will be installed, which will almost certainly cause the tests to fail due to missing packages. Please uncomment the dependencies that are required for the tests to run.

Comment on lines +26 to +34
# if [[ -n "$ARROW_MONGO_VERSION" ]]; then
# # Older versions of Arrow Mongo require an older version of NumPy.
# pip install numpy==1.23.5
# fi

# # Install MongoDB
# sudo apt-get purge -y mongodb*
# sudo apt-get install -y mongodb
# sudo rm -rf /var/lib/mongodb/mongod.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code related to ARROW_MONGO_VERSION and MongoDB installation has been commented out. If this logic is no longer needed, it should be removed completely instead of being commented out to improve code clarity and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants