Skip to content

Commit 1e61de3

Browse files
Hoder-zyfWinstonLiytXuJensen246you-n-g
authored
fix: refine the prompt to force complete code & refine the logic of running (#1069)
* change refine prompt for full code * fix: fix the logic of running * refine prompt * fix some bugs * fix * add two guidelines * refactor the code * make costeer evaluator more logical * refine eval prompt * make costeer eval prompt markdown * update code diff prompt * correct pipeline * feat: add apply_patch utility and update ret.py with patch functionality (#1071) * restore to the right version * fix the docstring * fix extract_output fcn * add inplace parameter to apply patch * remove enable_runner_iteration and make the eval prompt same as main * refine runner eval prompt based on main * Update rdagent/scenarios/data_science/dev/runner/prompts.yaml * add wait_retry * refactor: move enable_runner_code_diff to DSRunnerCoSTEERSettings as diff_mode * reformat and remove enable_runner_code_diff --------- Co-authored-by: yuanteli <[email protected]> Co-authored-by: Xu <[email protected]> Co-authored-by: Jensen Lee <[email protected]> Co-authored-by: you-n-g <[email protected]> Co-authored-by: Qizheng Li <[email protected]>
1 parent 958ea8a commit 1e61de3

File tree

7 files changed

+735
-78
lines changed

7 files changed

+735
-78
lines changed

rdagent/core/experiment.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ def link_all_files_in_folder_to_workspace(data_path: Path, workspace_path: Path)
184184
workspace_data_file_path = workspace_path / data_file_path.name
185185
if workspace_data_file_path.exists():
186186
workspace_data_file_path.unlink()
187-
if platform.system() == "Linux":
187+
if platform.system() in ("Linux", "Darwin"):
188188
os.symlink(data_file_path, workspace_data_file_path)
189189
if platform.system() == "Windows":
190190
os.link(data_file_path, workspace_data_file_path)

rdagent/scenarios/data_science/dev/runner/__init__.py

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,9 @@
2121
from rdagent.log import rdagent_logger as logger
2222
from rdagent.oai.llm_utils import APIBackend, md5_hash
2323
from rdagent.scenarios.data_science.dev.runner.eval import DSCoSTEERCoSTEEREvaluator
24-
from rdagent.utils.agent.ret import PythonBatchEditOut
24+
from rdagent.utils.agent.ret import PythonBatchEditOut, PythonBatchPatchOut
2525
from rdagent.utils.agent.tpl import T
26+
from rdagent.utils.workflow import wait_retry
2627

2728

2829
class DSRunnerCoSTEERSettings(DSCoderCoSTEERSettings):
@@ -33,40 +34,57 @@ class Config:
3334

3435
max_seconds: int = DS_RD_SETTING.full_timeout
3536
env_type: str = "docker"
37+
diff_mode: bool = False
3638
# TODO: extract a function for env and conf.
3739

3840

3941
class DSRunnerMultiProcessEvolvingStrategy(MultiProcessEvolvingStrategy):
42+
@wait_retry(retry_n=5)
4043
def implement_one_task(
4144
self,
4245
target_task: CoSTEERTask,
4346
queried_knowledge: CoSTEERQueriedKnowledge | None = None,
4447
workspace: FBWorkspace | None = None,
4548
prev_task_feedback: CoSTEERSingleFeedback | None = None,
4649
) -> dict[str, str]:
50+
4751
if prev_task_feedback is None:
48-
# if no prev_tak_feedback, it is the first loop; we do not make any changes and goto evaluators directly.
52+
# if no prev_task_feedback, it is the first loop; we do not make any changes and goto evaluators directly.
4953
return {}
54+
55+
# Output Agent Map
56+
output_map = {
57+
True: (PythonBatchPatchOut.get_spec(), PythonBatchPatchOut.extract_output),
58+
False: (
59+
PythonBatchEditOut.get_spec(with_del=False),
60+
PythonBatchEditOut.extract_output,
61+
),
62+
}
63+
output_spec, extract_output_fn = output_map[self.settings.diff_mode]
64+
5065
if prev_task_feedback.hyperparameter_tuning_decision:
51-
task_information_str = target_task.get_task_information()
52-
# 1. code
66+
# Use system_refine for hyperparameter tuning
5367
system_prompt = T(".prompts:DSCoSTEER.system_refine").r(
54-
out_spec=PythonBatchEditOut.get_spec(with_del=False),
68+
out_spec=output_spec,
69+
diff_mode=self.settings.diff_mode,
5570
)
5671
else:
5772
task_information_str = target_task.get_task_information()
58-
# 1. code
59-
system_prompt = T(".prompts:DSCoSTEER.system_debugger").r(
73+
# Use system_debugger for error fixing and debugging
74+
system_prompt = T(".prompts:DSCoSTEER.system_refine").r(
6075
task_desc=task_information_str,
61-
out_spec=PythonBatchEditOut.get_spec(with_del=False),
76+
out_spec=output_spec,
77+
diff_mode=self.settings.diff_mode,
6278
)
79+
80+
# Generate user prompt for both cases
6381
user_prompt = T(".prompts:DSCoSTEER.user").r(
6482
code=workspace.all_codes,
6583
feedback=prev_task_feedback,
6684
hyperparameter_tuning_suggestion=prev_task_feedback.hyperparameter_tuning_suggestion,
6785
)
6886

69-
batch_edit = PythonBatchEditOut.extract_output(
87+
batch_edit = extract_output_fn(
7088
APIBackend().build_messages_and_create_chat_completion(
7189
user_prompt=user_prompt,
7290
system_prompt=system_prompt,
@@ -112,7 +130,7 @@ def __init__(
112130
settings = DSRunnerCoSTEERSettings()
113131
es = DSRunnerMultiProcessEvolvingStrategy(scen=scen, settings=settings)
114132

115-
# In runner, we don't need very big loops, so we set max_loop to 3
133+
# In runner, we don't need very big loops, so we set max_loop to runner_max_loop
116134
super().__init__(
117135
*args,
118136
settings=settings,
Lines changed: 95 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,60 @@
11
DSCoSTEER_eval:
22
system: |-
3-
You are a data scientist responsible for evaluating all the code.
4-
5-
## Target Task Description
6-
The user is trying to build a data science solution in the following scenario:
3+
{% include "scenarios.data_science.share:scen.role" %}
4+
{% if is_sub_enabled %}
5+
You will be provided with:
6+
1. `Code base`: The code base of the solution
7+
2. `The stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing
8+
3, `The time spent on code execution`: The time spent on the code execution
9+
4. `The timeout of code execution`: the time limitation of the code execution
10+
5. `The percent of timeout used`: the percentage of the time limitation used
11+
Your task is to perform the following evaluation(s):
12+
13+
# Evalution 1: Code Correctness
14+
## Scenario
15+
The code is focusing on the following scenario:
716
{{ scenario }}
817
9-
The task is as follows:
18+
## Target Task Description
19+
The code is focusing on the following task
1020
{{ task_desc }}
1121
22+
## Runtime Environment
1223
You have following environment to run the code:
1324
{{ runtime_environment }}
1425
15-
The whole workflow includes multiple stages, such as:
16-
- Data loading
17-
- Feature engineering
18-
- Model training
19-
- Ensembling
20-
21-
## You'll be provided with the following information about a solution to the Target Task
22-
`code base`: The code base of the solution
23-
`the stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing
24-
`the time spent on code execution`: The time spent on the code execution
25-
`the timeout of code execution`: the time limitation of the code execution
26-
`the percent of timeout used`: the percentage of the time limitation used
27-
28-
## Your task is to provide feedback on the solution to the Target Task
29-
In the feedback response,
30-
Evaluate the code base based on several aspects, including execution, return checking, and code quality. After your evaluation, make a clear decision to either accept or reject the solution in the `final_decision` section.
31-
26+
## Evaluation Guidelines
27+
1. Evaluate the code base based on several aspects, including execution correctness, return checking, and code quality.
28+
2. Ensure the code does not contain any incorrect, fabricated, or deceptive operations, such as mocking data, scores, or results.
29+
3. Confirm that the prediction file (`submission.csv`) is generated using only the test dataset, and its format matches the sample submission.
30+
If the code does not satisfy the requirements:
31+
- Set "final_decision" to false.
32+
- set "hyperparameter_tuning_decision" to false.
33+
- Set "hyperparameter_tuning_suggestion" to an empty string.
34+
If the code satisfy the requirements:
35+
- Proceed to the next evaluation.
36+
37+
# Evaluation 2: Hyperparameter
38+
## Evaluation Description
3239
The user will provide you the time spent on the whole code execution and the timeout of the code execution. You should decide whether the hyperparameter is reasonable based on the time.
3340
For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters.
34-
3541
You should also notice other resources utilization hyper-parameters,
3642
For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable.
3743
38-
Please provide your feedback in two key-value pairs:
39-
"hyperparameter_tuning_decision": <true/false>
40-
"hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning, e.g., increase n_estimators to 1000, increase epochs to 100, increase batch size to 64, give an empty string if decide not to tune the hyperparameter>
41-
[Notice]
42-
- You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter.
43-
- Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets.
44-
- Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it.
45-
- Once you decide to tune the hyperparameter you should set "final_decision" to false.
46-
[Format]
47-
- "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance."
48-
49-
{% if is_sub_enabled %}
50-
The user will provide you the whole code base, some logs generated during the execution of the whole workflow. Your evaluation scope includes whether the workflow code:
51-
1. Executes successfully, correctly organizing components and generating a final submission.
52-
2. Generates predictions in the correct format, ensuring they align with the **sample submission** structure!
53-
54-
Please respond with your feedback in the following JSON format and order
44+
## Evaluation Guidelines
45+
1. The code execution time or resource utilization suggest that there is room for improvement in the hyperparameters.
46+
2. The code must apply early stopping strategy already (in order to prevent overfitting).
47+
3. Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it.
48+
If the code satisfy the requirements:
49+
- Set "hyperparameter_tuning_decision" to true.
50+
- Set "final_decision" to false.
51+
- Provide a reasonable suggestion in "hyperparameter_tuning_suggestion". The "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance."
52+
If the code does not satisfy the requirements:
53+
- Set "hyperparameter_tuning_decision" to false.
54+
- Set "hyperparameter_tuning_suggestion" to an empty string.
55+
56+
## Output format
57+
Please respond with your feedback in the following JSON format and order without anything else:
5558
```json
5659
{
5760
"execution": "Describe whether the whole code base executed successfully and generating the final submission. Include any errors or issues encountered, and retain all error messages and traceback details.",
@@ -81,62 +84,90 @@ DSCoSTEER_eval:
8184
# NOTE: when is_sub_enabled == False, we don't have any checking about the return. So it is just placeholder currently
8285

8386
user: |-
84-
--------- code base ---------
87+
# Code base
8588
{{ code }}
86-
--------- the stdout of code execution and testing ---------
89+
90+
## Stdout of code execution and testing
8791
{{ stdout }}
88-
--------- the time spent on code execution ---------
92+
93+
# The time spend on code execution and timeout
8994
{{ time_spent }}
90-
--------- the timeout of code execution ---------
95+
96+
## The timeout of code execution
9197
{{ timeout }}
92-
--------- the percent of timeout used ---------
98+
99+
## The percent of timeout used
93100
{{ percent_of_timeout_used }}
94101
95102
DSCoSTEER:
96103
system_debugger: |-
97-
You are a world-class data scientist and machine learning engineer with deep expertise in statistics, mathematics, and computer science.
98-
You have finished the implementation of the whole workflow which has executed well on a sampled dataset. However, the user has reported that the workflow failed to execute on the full dataset.
99-
100-
Your current job is to debug the whole code base, try to correct the errors, and ensure that the workflow can execute successfully on the full dataset.
101-
The user will provide your the whole code base and some feedback generated during the execution of the whole workflow. Please identify the issues and provide the corrected code.
102-
103-
Task description:
104+
{% include "scenarios.data_science.share:scen.role" %}
105+
You have finished the implementation of the whole workflow which has executed well on a sampled dataset. Now we are working on the full dataset.
106+
The user has reported that the workflow failed to execute on the full dataset.
107+
Your will be provided with:
108+
1. Code base.
109+
2. Task description, which is the task the code is trying to solve.
110+
3. Feedback generated during the execution of the whole workflow.
111+
4. Suggestions for hyperparameter tuning.
112+
Your job is to debug the whole code base, try to correct the errors, and ensure that the workflow can execute successfully on the full dataset.
113+
114+
## Task description
104115
{{ task_desc }}
105116
106-
Your modified code should follow the minimal changes principle. You should only modify the code that is necessary to fix the issues but not affect any other parts of the code. Try to correct as less files as possible since files are interdependent.
117+
## Instructions
118+
1. Minimal changes principle: only modify the code that is necessary to fix the issues but not affect any other parts of the code. Try to correct as less files as possible since files are interdependent.
119+
{% if diff_mode %}
120+
2. You must output in Code Diff format. The detailed format specification is as follows.
121+
{% else %}
122+
2. You must output the COMPLETE and FULL code. Do not truncate, summarize, or omit any parts of the code. Include all imports, functions, classes, and the entire workflow from start to finish.
123+
{% endif %}
107124
108125
## Output Format
109126
{% if out_spec %}
110127
{{ out_spec }}
111128
{% else %}
112-
Please response the code in the following json format. Here is an example structure for the JSON output:
129+
Please response the code in the following JSON format without anything else.
113130
{
114131
"code": "The Python code as a string."
115132
}
116133
{% endif %}
117-
system_refine: |-
118-
You are a world-class data scientist and machine learning engineer with deep expertise in statistics, mathematics, and computer science.
119-
You have finished the implementation of the whole workflow which has executed well on a sampled dataset. However, the user has reported that the hyperparameters are not reasonable and the code didn't make the best use of the time limit.
120-
Your current job is to refine the whole code base, try to refine the hyperparameters.
121134
122-
The user will provide your the whole code base, some feedback generated during the execution of the whole workflow and some suggestions for hyperparameter tuning.
123-
Your modified code should follow the minimal changes principle. Only modify the hyperparameters that is necessary.
135+
system_refine: |-
136+
{% include "scenarios.data_science.share:scen.role" %}
137+
You have finished the implementation of the whole workflow which has executed well on a sampled dataset. Now we are working on the full dataset.
138+
The user has reported that the hyperparameters are not reasonable and the code didn't make the best use of the time limit.
139+
Your will be provided with:
140+
1. Code base.
141+
2. Feedback generated during the execution of the whole workflow.
142+
3. Suggestions for hyperparameter tuning.
143+
Your task is to refine the code base and modify the hyperparameters based on the feedback and suggestions.
144+
145+
## Instructions
146+
1. Minimal changes principle: only modify necessary hyperparameters based on the feedback and suggestions.
147+
{% if diff_mode %}
148+
2. You must output in Code Diff format. The detailed format specification is as follows.
149+
{% else %}
150+
2. You must output the COMPLETE and FULL code. Do not truncate, summarize, or omit any parts of the code. Include all imports, functions, classes, and the entire workflow from start to finish.
151+
{% endif %}
124152
125153
## Output Format
126154
{% if out_spec %}
127155
{{ out_spec }}
128156
{% else %}
129-
Please response the code in the following json format. Here is an example structure for the JSON output:
157+
Please response the code in the following JSON format without anything else.
130158
{
131159
"code": "The Python code as a string."
132160
}
133161
{% endif %}
162+
134163
user: |-
135-
--------- code base ---------
164+
# Code Base
136165
{{ code }}
137-
--------- feedback ---------
166+
167+
## Feedback
138168
{{ feedback }}
169+
139170
{% if hyperparameter_tuning_suggestion is not none %}
140-
--------- hyperparameter tuning suggestions ---------
171+
## Hyperparameter Tuning Suggestion
141172
{{ hyperparameter_tuning_suggestion }}
142173
{% endif %}

0 commit comments

Comments
 (0)