-
Notifications
You must be signed in to change notification settings - Fork 49
Description
What feature or enhancement are you proposing?
Thank you so much for generously providing the entire training dataset! I believe there's an aspect of the N1 sub-dataset that could be further optimized to make it even more usable.
During my data inspection, I noticed an inconsistency in the ordering of task entries between two files:
In the file
InternData-N1-mini\vln_n1\traj_data\matterport3d_d435i\B6ByNegPMKs\trajectory_14\meta\tasks.jsonl,
the task instructions are structured as follows:
{"task_index": 0, "task": "{\"sub_instruction\": \"Walk straight ahead, passing the black office chair on your left and the whiteboard on your right. Stop at the end of the corridor where the wall meets the floor.\", \"sub_indexes\": [0, 49], \"revised_sub_instruction\": \"March forward with purpose, gliding past the sleek obsidian chair to your left and the chalk-clad board to your right. Arrive at the corridor’s terminus where the wall folds into the floor, signaling the endpoint.\"}"}
{"task_index": 1, "task": "{\"sum_instruction\": \"March forward with purpose, gliding past the sleek obsidian chair to your left and the chalk-clad board to your right. Arrive at the corridor's terminus where the wall folds into the floor, signaling the endpoint.\", \"sum_indexes\": [0, 49]}"}Whereas in the file
InternData-N1-mini\vln_n1\traj_data\matterport3d_d435i\1LXtFkjw3qL\trajectory_2\meta\tasks.jsonl,
the order is reversed:
{"task_index": 0, "task": "{\"sum_instruction\": \"Maintain a straight course, with the sleek ebony chair drifting by to your left and flowing ivory drapes swaying on the right—halt where the hallway meets the gentle curve of the staircase.\", \"sum_indexes\": [0, 121]}"}
{"task_index": 1, "task": "{\"sub_instruction\": \"Walk straight ahead, passing the black armchair on your left and the white curtains on your right. Stop at the end of the hallway where the staircase begins.\", \"sub_indexes\": [0, 121], \"revised_sub_instruction\": \"Maintain a straight course, with the sleek ebony chair drifting by to your left and flowing ivory drapes swaying on the right—halt where the hallway meets the gentle curve of the staircase.\"}"}Specifically, the entry with task_index: 0 corresponds to the "sub_instruction" in the first file but to the "sum_instruction" in the second file. This inconsistent ordering may negatively impact fine-tuning efforts for models that rely on consistent task indexing across scenes.
It would be very helpful if this inconsistency could be explicitly noted in the dataset documentation, enabling future users to perform appropriate preprocessing and avoid potential issues during training.
Motivation
The intention of the suggestions I put forward is to enable subsequent users of the N1 sub - dataset to better apply this dataset. (It might be a bug. I'm not sure about this, so I'm posting it under the Enhancement category.)
Additional information
No response