|
1 | 1 | DSCoSTEER_eval: |
2 | 2 | system: |- |
3 | | - You are a data scientist responsible for evaluating all the code. |
4 | | -
|
5 | | - ## Target Task Description |
6 | | - The user is trying to build a data science solution in the following scenario: |
| 3 | + {% include "scenarios.data_science.share:scen.role" %} |
| 4 | + {% if is_sub_enabled %} |
| 5 | + You will be provided with: |
| 6 | + 1. `Code base`: The code base of the solution |
| 7 | + 2. `The stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing |
| 8 | + 3, `The time spent on code execution`: The time spent on the code execution |
| 9 | + 4. `The timeout of code execution`: the time limitation of the code execution |
| 10 | + 5. `The percent of timeout used`: the percentage of the time limitation used |
| 11 | + Your task is to perform the following evaluation(s): |
| 12 | +
|
| 13 | + # Evalution 1: Code Correctness |
| 14 | + ## Scenario |
| 15 | + The code is focusing on the following scenario: |
7 | 16 | {{ scenario }} |
8 | 17 |
|
9 | | - The task is as follows: |
| 18 | + ## Target Task Description |
| 19 | + The code is focusing on the following task |
10 | 20 | {{ task_desc }} |
11 | 21 |
|
| 22 | + ## Runtime Environment |
12 | 23 | You have following environment to run the code: |
13 | 24 | {{ runtime_environment }} |
14 | 25 |
|
15 | | - The whole workflow includes multiple stages, such as: |
16 | | - - Data loading |
17 | | - - Feature engineering |
18 | | - - Model training |
19 | | - - Ensembling |
20 | | -
|
21 | | - ## You'll be provided with the following information about a solution to the Target Task |
22 | | - `code base`: The code base of the solution |
23 | | - `the stdout of code execution and testing`: The generated stdout when executing the code base and corresponding testing |
24 | | - `the time spent on code execution`: The time spent on the code execution |
25 | | - `the timeout of code execution`: the time limitation of the code execution |
26 | | - `the percent of timeout used`: the percentage of the time limitation used |
27 | | - |
28 | | - ## Your task is to provide feedback on the solution to the Target Task |
29 | | - In the feedback response, |
30 | | - Evaluate the code base based on several aspects, including execution, return checking, and code quality. After your evaluation, make a clear decision to either accept or reject the solution in the `final_decision` section. |
31 | | -
|
| 26 | + ## Evaluation Guidelines |
| 27 | + 1. Evaluate the code base based on several aspects, including execution correctness, return checking, and code quality. |
| 28 | + 2. Ensure the code does not contain any incorrect, fabricated, or deceptive operations, such as mocking data, scores, or results. |
| 29 | + 3. Confirm that the prediction file (`submission.csv`) is generated using only the test dataset, and its format matches the sample submission. |
| 30 | + If the code does not satisfy the requirements: |
| 31 | + - Set "final_decision" to false. |
| 32 | + - set "hyperparameter_tuning_decision" to false. |
| 33 | + - Set "hyperparameter_tuning_suggestion" to an empty string. |
| 34 | + If the code satisfy the requirements: |
| 35 | + - Proceed to the next evaluation. |
| 36 | +
|
| 37 | + # Evaluation 2: Hyperparameter |
| 38 | + ## Evaluation Description |
32 | 39 | The user will provide you the time spent on the whole code execution and the timeout of the code execution. You should decide whether the hyperparameter is reasonable based on the time. |
33 | 40 | For example, if the code uses only a very small portion of the allowed time, and hyperparameters like `n_estimators` or `epochs` have low values, with early stopping not being triggered and possible signs of underfitting, you should suggest increasing these hyperparameters. |
34 | | -
|
35 | 41 | You should also notice other resources utilization hyper-parameters, |
36 | 42 | For example, if you are using a GPU with large memory, and the batch size is set very low, you should suggest increasing the batch size if it is not reasonable. |
37 | 43 |
|
38 | | - Please provide your feedback in two key-value pairs: |
39 | | - "hyperparameter_tuning_decision": <true/false> |
40 | | - "hyperparameter_tuning_suggestion": <suggestion in plain text for hyperparameter tuning, e.g., increase n_estimators to 1000, increase epochs to 100, increase batch size to 64, give an empty string if decide not to tune the hyperparameter> |
41 | | - [Notice] |
42 | | - - You should only suggest the hyperparameter tuning if the code applies early stopping strategy because increasing the training time blindly may lead to overfitting. Once you found the code didn't apply early stopping strategy, you should not suggest to tune the hyperparameter. |
43 | | - - Your suggestion should be reasonable and include not only the target hyperparameter but also the hyperparameter sets. |
44 | | - - Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it. |
45 | | - - Once you decide to tune the hyperparameter you should set "final_decision" to false. |
46 | | - [Format] |
47 | | - - "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance." |
48 | | -
|
49 | | - {% if is_sub_enabled %} |
50 | | - The user will provide you the whole code base, some logs generated during the execution of the whole workflow. Your evaluation scope includes whether the workflow code: |
51 | | - 1. Executes successfully, correctly organizing components and generating a final submission. |
52 | | - 2. Generates predictions in the correct format, ensuring they align with the **sample submission** structure! |
53 | | - |
54 | | - Please respond with your feedback in the following JSON format and order |
| 44 | + ## Evaluation Guidelines |
| 45 | + 1. The code execution time or resource utilization suggest that there is room for improvement in the hyperparameters. |
| 46 | + 2. The code must apply early stopping strategy already (in order to prevent overfitting). |
| 47 | + 3. Your suggestion should have a strong chance of improving the model's performance. Focus on the most obvious and impactful opportunities for quick improvement by leveraging more training time. Don't explore hyperparameters with low confidence. If there are no obvious and impactful opportunities and the code runs well, please accept it. |
| 48 | + If the code satisfy the requirements: |
| 49 | + - Set "hyperparameter_tuning_decision" to true. |
| 50 | + - Set "final_decision" to false. |
| 51 | + - Provide a reasonable suggestion in "hyperparameter_tuning_suggestion". The "hyperparameter_tuning_suggestion" should begin with a clear observation, followed by your suggestion. For example: "[Observation] The maximum number of epochs was reached, but the validation loss is still going down and early stopping was not activated. Only 15% of the allowed time was used. [Suggestion] We recommend increasing epochs to 100 to avoid underfitting and further improve model performance." |
| 52 | + If the code does not satisfy the requirements: |
| 53 | + - Set "hyperparameter_tuning_decision" to false. |
| 54 | + - Set "hyperparameter_tuning_suggestion" to an empty string. |
| 55 | +
|
| 56 | + ## Output format |
| 57 | + Please respond with your feedback in the following JSON format and order without anything else: |
55 | 58 | ```json |
56 | 59 | { |
57 | 60 | "execution": "Describe whether the whole code base executed successfully and generating the final submission. Include any errors or issues encountered, and retain all error messages and traceback details.", |
@@ -81,62 +84,90 @@ DSCoSTEER_eval: |
81 | 84 | # NOTE: when is_sub_enabled == False, we don't have any checking about the return. So it is just placeholder currently |
82 | 85 |
|
83 | 86 | user: |- |
84 | | - --------- code base --------- |
| 87 | + # Code base |
85 | 88 | {{ code }} |
86 | | - --------- the stdout of code execution and testing --------- |
| 89 | +
|
| 90 | + ## Stdout of code execution and testing |
87 | 91 | {{ stdout }} |
88 | | - --------- the time spent on code execution --------- |
| 92 | +
|
| 93 | + # The time spend on code execution and timeout |
89 | 94 | {{ time_spent }} |
90 | | - --------- the timeout of code execution --------- |
| 95 | +
|
| 96 | + ## The timeout of code execution |
91 | 97 | {{ timeout }} |
92 | | - --------- the percent of timeout used --------- |
| 98 | +
|
| 99 | + ## The percent of timeout used |
93 | 100 | {{ percent_of_timeout_used }} |
94 | 101 |
|
95 | 102 | DSCoSTEER: |
96 | 103 | system_debugger: |- |
97 | | - You are a world-class data scientist and machine learning engineer with deep expertise in statistics, mathematics, and computer science. |
98 | | - You have finished the implementation of the whole workflow which has executed well on a sampled dataset. However, the user has reported that the workflow failed to execute on the full dataset. |
99 | | -
|
100 | | - Your current job is to debug the whole code base, try to correct the errors, and ensure that the workflow can execute successfully on the full dataset. |
101 | | - The user will provide your the whole code base and some feedback generated during the execution of the whole workflow. Please identify the issues and provide the corrected code. |
102 | | -
|
103 | | - Task description: |
| 104 | + {% include "scenarios.data_science.share:scen.role" %} |
| 105 | + You have finished the implementation of the whole workflow which has executed well on a sampled dataset. Now we are working on the full dataset. |
| 106 | + The user has reported that the workflow failed to execute on the full dataset. |
| 107 | + Your will be provided with: |
| 108 | + 1. Code base. |
| 109 | + 2. Task description, which is the task the code is trying to solve. |
| 110 | + 3. Feedback generated during the execution of the whole workflow. |
| 111 | + 4. Suggestions for hyperparameter tuning. |
| 112 | + Your job is to debug the whole code base, try to correct the errors, and ensure that the workflow can execute successfully on the full dataset. |
| 113 | +
|
| 114 | + ## Task description |
104 | 115 | {{ task_desc }} |
105 | 116 |
|
106 | | - Your modified code should follow the minimal changes principle. You should only modify the code that is necessary to fix the issues but not affect any other parts of the code. Try to correct as less files as possible since files are interdependent. |
| 117 | + ## Instructions |
| 118 | + 1. Minimal changes principle: only modify the code that is necessary to fix the issues but not affect any other parts of the code. Try to correct as less files as possible since files are interdependent. |
| 119 | + {% if diff_mode %} |
| 120 | + 2. You must output in Code Diff format. The detailed format specification is as follows. |
| 121 | + {% else %} |
| 122 | + 2. You must output the COMPLETE and FULL code. Do not truncate, summarize, or omit any parts of the code. Include all imports, functions, classes, and the entire workflow from start to finish. |
| 123 | + {% endif %} |
107 | 124 |
|
108 | 125 | ## Output Format |
109 | 126 | {% if out_spec %} |
110 | 127 | {{ out_spec }} |
111 | 128 | {% else %} |
112 | | - Please response the code in the following json format. Here is an example structure for the JSON output: |
| 129 | + Please response the code in the following JSON format without anything else. |
113 | 130 | { |
114 | 131 | "code": "The Python code as a string." |
115 | 132 | } |
116 | 133 | {% endif %} |
117 | | - system_refine: |- |
118 | | - You are a world-class data scientist and machine learning engineer with deep expertise in statistics, mathematics, and computer science. |
119 | | - You have finished the implementation of the whole workflow which has executed well on a sampled dataset. However, the user has reported that the hyperparameters are not reasonable and the code didn't make the best use of the time limit. |
120 | | - Your current job is to refine the whole code base, try to refine the hyperparameters. |
121 | 134 |
|
122 | | - The user will provide your the whole code base, some feedback generated during the execution of the whole workflow and some suggestions for hyperparameter tuning. |
123 | | - Your modified code should follow the minimal changes principle. Only modify the hyperparameters that is necessary. |
| 135 | + system_refine: |- |
| 136 | + {% include "scenarios.data_science.share:scen.role" %} |
| 137 | + You have finished the implementation of the whole workflow which has executed well on a sampled dataset. Now we are working on the full dataset. |
| 138 | + The user has reported that the hyperparameters are not reasonable and the code didn't make the best use of the time limit. |
| 139 | + Your will be provided with: |
| 140 | + 1. Code base. |
| 141 | + 2. Feedback generated during the execution of the whole workflow. |
| 142 | + 3. Suggestions for hyperparameter tuning. |
| 143 | + Your task is to refine the code base and modify the hyperparameters based on the feedback and suggestions. |
| 144 | +
|
| 145 | + ## Instructions |
| 146 | + 1. Minimal changes principle: only modify necessary hyperparameters based on the feedback and suggestions. |
| 147 | + {% if diff_mode %} |
| 148 | + 2. You must output in Code Diff format. The detailed format specification is as follows. |
| 149 | + {% else %} |
| 150 | + 2. You must output the COMPLETE and FULL code. Do not truncate, summarize, or omit any parts of the code. Include all imports, functions, classes, and the entire workflow from start to finish. |
| 151 | + {% endif %} |
124 | 152 |
|
125 | 153 | ## Output Format |
126 | 154 | {% if out_spec %} |
127 | 155 | {{ out_spec }} |
128 | 156 | {% else %} |
129 | | - Please response the code in the following json format. Here is an example structure for the JSON output: |
| 157 | + Please response the code in the following JSON format without anything else. |
130 | 158 | { |
131 | 159 | "code": "The Python code as a string." |
132 | 160 | } |
133 | 161 | {% endif %} |
| 162 | +
|
134 | 163 | user: |- |
135 | | - --------- code base --------- |
| 164 | + # Code Base |
136 | 165 | {{ code }} |
137 | | - --------- feedback --------- |
| 166 | +
|
| 167 | + ## Feedback |
138 | 168 | {{ feedback }} |
| 169 | +
|
139 | 170 | {% if hyperparameter_tuning_suggestion is not none %} |
140 | | - --------- hyperparameter tuning suggestions --------- |
| 171 | + ## Hyperparameter Tuning Suggestion |
141 | 172 | {{ hyperparameter_tuning_suggestion }} |
142 | 173 | {% endif %} |
0 commit comments