-
Notifications
You must be signed in to change notification settings - Fork 7k
[DNR] #59624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[DNR] #59624
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| # distributed==2023.6.1; python_version < '3.12' | ||
| # dask[complete]==2025.5.0; python_version >= '3.12' | ||
| # distributed==2025.5.0; python_version >= '3.12' | ||
| # # pymars>=0.8.3; python_version < "3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All test requirements commented out, no packages installed
The entire data-test-requirements.txt file has all package dependencies commented out. Every line is either a comment (prefixed with #) or blank. When install-dependencies.sh installs this file with DATA_PROCESSING_TESTING=1, no packages will actually be installed. This will cause data processing tests to fail with missing import errors for packages like tensorflow-datasets, pytest-repeat, pytest-mock, fastavro, google-cloud-bigquery, and many others that tests depend on.
| # # Install MongoDB | ||
| # sudo apt-get purge -y mongodb* | ||
| # sudo apt-get install -y mongodb | ||
| # sudo rm -rf /var/lib/mongodb/mongod.lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MongoDB installation commented out breaks database tests
The MongoDB installation code is commented out in the Dockerfile, but the datamongo.build.wanda.yaml config still uses ARROW_MONGO_VERSION=0.5.* and the Docker image is expected to support MongoDB tests. This will cause MongoDB-related data tests to fail due to the database not being installed in the container.
| # if [[ -n "$ARROW_MONGO_VERSION" ]]; then | ||
| # # Older versions of Arrow Mongo require an older version of NumPy. | ||
| # pip install numpy==1.23.5 | ||
| # fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy version pinning for Arrow Mongo removed
The NumPy version pinning code for older Arrow Mongo versions is commented out. The datamongo.build.wanda.yaml still uses ARROW_MONGO_VERSION=0.5.*, and the comment explicitly states "Older versions of Arrow Mongo require an older version of NumPy." Without this pip install numpy==1.23.5, pymongoarrow==0.5.* may encounter compatibility issues or import errors with newer NumPy versions that get installed by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the CI configuration for data tests by switching to a new base Docker image (oss-ci-base_test) and standardizing the build steps in the Dockerfiles. While these changes improve consistency, there's a critical issue in python/requirements/ml/data-test-requirements.txt where all dependencies have been commented out, which will likely cause the data tests to fail. Additionally, there's some commented-out code in ci/docker/data.build.Dockerfile that should be removed for better maintainability.
| # numpy | ||
| # pandas | ||
| # pyarrow | ||
| # polars | ||
| # dask | ||
| # modin | ||
| # python-snappy | ||
|
|
||
| # # Integrations | ||
| # pyiceberg[sql-sqlite]==0.10.0 | ||
| # delta-sharing | ||
| # hudi==0.4.0 | ||
|
|
||
| # ## Lance | ||
| # pylance==0.22 | ||
|
|
||
| # ## Avro | ||
| # fastavro | ||
|
|
||
| # ## Clickhouse | ||
| # clickhouse-connect | ||
|
|
||
|
|
||
|
|
||
| # # Webdataset | ||
| # msgpack | ||
| # webdataset | ||
|
|
||
| # # Mongo | ||
| # pymongo | ||
| # pymongoarrow | ||
| # bson | ||
|
|
||
|
|
||
| # # Snowflake | ||
| # snowflake-connector-python>=3.15.0 | ||
|
|
||
| # # BigQueryDatasource | ||
| # google-cloud-bigquery | ||
| # google-cloud-core | ||
| # google-cloud-bigquery-storage | ||
| # google-api-core | ||
|
|
||
| # # KafkaDatasource | ||
| # kafka-python | ||
|
|
||
| # # MCAPDatasource | ||
| # mcap | ||
|
|
||
|
|
||
| # # TorchDatasource | ||
| # torch | ||
|
|
||
| # # TorchVision preprocessor | ||
| # torchvision | ||
|
|
||
|
|
||
| # # `test_sql` | ||
| # pymysql | ||
|
|
||
| # # TFRecords | ||
| # tensorflow | ||
| # tensorflow-datasets==4.9.3 | ||
| # tensorflow-metadata | ||
| # crc32c==2.3 | ||
| # # tfx-bsl | ||
|
|
||
| # pyspark | ||
| # transformers | ||
| # datasets>=3.0.2 | ||
|
|
||
| # Pillow | ||
| # soundfile | ||
| # # decord | ||
| # av | ||
|
|
||
| # fsspec | ||
| # requests | ||
| # aiohttp | ||
| # pydantic | ||
| # packaging | ||
| # cloudpickle | ||
| # rich | ||
| # tqdm | ||
| # IPython | ||
| # ipywidgets | ||
| # PyYAML | ||
| # datasketches | ||
| # getdaft==0.4.3 | ||
| # pytest | ||
| # testcontainers[kafka] | ||
|
|
||
|
|
||
| # pytest-repeat | ||
| # pytest-mock | ||
|
|
||
|
|
||
|
|
||
| # raydp==1.7.0b20250423.dev0 | ||
|
|
||
| # deltalake==0.9.0 | ||
|
|
||
| # pybase64 | ||
|
|
||
| # dask[complete]==2023.6.1; python_version < '3.12' | ||
| # distributed==2023.6.1; python_version < '3.12' | ||
| # dask[complete]==2025.5.0; python_version >= '3.12' | ||
| # distributed==2025.5.0; python_version >= '3.12' | ||
| # # pymars>=0.8.3; python_version < "3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All packages in this requirements file have been commented out. This file is used to install dependencies for data processing tests when DATA_PROCESSING_TESTING=1 is set. With all lines commented, no dependencies will be installed, which will almost certainly cause the tests to fail due to missing packages. Please uncomment the dependencies that are required for the tests to run.
| # if [[ -n "$ARROW_MONGO_VERSION" ]]; then | ||
| # # Older versions of Arrow Mongo require an older version of NumPy. | ||
| # pip install numpy==1.23.5 | ||
| # fi | ||
|
|
||
| # # Install MongoDB | ||
| # sudo apt-get purge -y mongodb* | ||
| # sudo apt-get install -y mongodb | ||
| # sudo rm -rf /var/lib/mongodb/mongod.lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
Related issues
Additional information