microsoft · peteryang1 · Jul 8, 2025 · Jun 27, 2025 · Jun 27, 2025 · Jun 27, 2025
diff --git a/rdagent/components/coder/data_science/pipeline/__init__.py b/rdagent/components/coder/data_science/pipeline/__init__.py
@@ -95,6 +95,7 @@ def implement_one_task(
             queried_former_failed_knowledge=queried_former_failed_knowledge[0],
             out_spec=PythonAgentOut.get_spec(),
             runtime_environment=runtime_environment,
+            hyperparameter_spec=T("scenarios.data_science.share:spec.hyperparameter").r(),
             spec=T("scenarios.data_science.share:component_spec.Pipeline").r(),
             enable_model_dump=DS_RD_SETTING.enable_model_dump,
         )

diff --git a/rdagent/components/coder/data_science/pipeline/prompts.yaml b/rdagent/components/coder/data_science/pipeline/prompts.yaml
@@ -9,6 +9,11 @@ pipeline_coder:
     ## The runtime environment your code will running on
     {{ runtime_environment }}
 
+    ## Hyperparameters Specification
+    Follow the hyperparameter choices if they are specified in the task description, unless prior attempts have demonstrated that they are ineffective or incorrect.
+    In this case, refer to the guidelines below for appropriate adjustments:
+    {{ hyperparameter_spec }}
+
     ## Specification your code should follow
     {{ spec }}
 
@@ -118,7 +123,10 @@ pipeline_eval:
     Step 1: Executes successfully without any errors. Please distinguish between the errors and warnings.
 
     Step 2: Correctly generates a final submission in the correct format, ensuring: they align with the submission structure, the index names and column names should match the sample, and the items should not be empty or apparently incorrect.
-
+    - Carefully check that the submission file and any reported scores are genuinely produced by a real model training and inference process. 
+    - Any attempt to bypass model training or inference—such as generating random predictions, hard-coding outputs, or otherwise fabricating results—should be considered cheating and must result in evaluation failure.
+    - Any attempt to sample a subset of training data for efficiency is not allowed. All training data must be loaded and used.
+
     Step 3: Aligns with the competition requirements. This includes:
     - CAREFULLY ANALYZE WHETHER THE EXPERIMENTAL SETUP AND CODE MAY CAUSE MISALIGNMENT BETWEEN VALIDATION AND TEST PERFORMANCE.
     - Confirm strict adherence to the competition's evaluation rules listed in `scenario`:
@@ -137,6 +145,7 @@ pipeline_eval:
     [Note] 
     1. Model performance is NOT a concern in this evaluation—only correct execution and formatting matter.
     2. You only check the format of the submission since we only feed you part of the data, so the submission might has different index to the sample submission data.
+    3. Submissions and scores must be the result of actual model training and inference. Any form of cheating or fabrication (e.g., random or hard-coded outputs) is strictly prohibited and should lead to rejection.
 
     Please respond with your feedback in the following JSON format and order
     ```json

diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/idea_pool.py b/rdagent/scenarios/data_science/proposal/exp_gen/idea_pool.py
@@ -7,9 +7,7 @@
 from rdagent.components.knowledge_management.graph import (
     UndirectedNode,  # TODO: add appendix attribute to node
 )
-from rdagent.components.knowledge_management.graph import (
-    UndirectedGraph,
-)
+from rdagent.components.knowledge_management.graph import UndirectedGraph
 from rdagent.log import rdagent_logger as logger
 from rdagent.oai.llm_utils import APIBackend
 from rdagent.utils.agent.tpl import T

diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml b/rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
@@ -206,7 +206,7 @@ hypothesis_gen:
         - *Good Example (Efficiency)*: "To resolve the 'timeout during training' challenge, reduce `NUM_EPOCHS` from 5 to 2 and `N_SPLITS` for cross-validation from 5 to 3 in the main training loop, aiming to complete execution within the 1-hour limit while minimizing impact on the F1-score."
         - *Poor Example*: "Tune the model for better results."
       - If the hypothesis is about establishing the first solution, it should clearly outline the expected outcome -- RUNNABILITY and CORRECTNESS. Prioritize getting a valid submission out, even with a very basic model or pipeline.
-        - *Good Example*: "Implement a simple RandomForest classifier with default parameters, using 5-fold cross-validation for model evaluation. This will lead to a decent baseline model that can run to completion and generate a valid submission file."
+        - *Good Example*: "Implement a simple RandomForest classifier with default parameters, using 3-fold cross-validation for model evaluation. This will lead to a decent baseline model that can run to completion and generate a valid submission file."
     3. **Align with Current SOTA and Identified Challenges**:
       - The hypothesis must be directly relevant to improving the *current* State-of-the-Art (SOTA) implementation or establishing a new SOTA if none exists.
       - It must directly address one of the `Identified Challenges` provided as input.
@@ -280,7 +280,7 @@ task_gen:
 
     Your primary goal is to generate a detailed, step-by-step **sketch or refinement plan** for a new data processing and modeling pipeline, specifically for the main workflow script (`main.py`), that effectively implements the `Proposed Hypothesis`. This sketch will guide a developer to write the code correctly.
 
-    ### BACKGROUND CONTEXT: Pipeline Implementation Standards & Constraints ###
+    # BACKGROUND CONTEXT: Pipeline Implementation Standards & Constraints
 
     The `main.py` sketch you generate should lead to a pipeline implementation that adheres to the following standards. These are guiding principles for the final *outcome* of your sketch:
 
@@ -309,15 +309,13 @@ task_gen:
       - Prevent data leakage from test/validation sets into any training stage.
     7. **Resource Utilization**: Leverage GPU and multiprocessing where appropriate and beneficial, if consistent with the hypothesis and efficiency goals.
     8. **Metric Calculation and Storage (`scores.csv`)**:
-      - Calculate the official competition metric on a proper validation set (e.g., K-fold CV, typically 3-5 folds unless efficiency dictates fewer). Save results to `scores.csv`.
+      - Calculate the official competition metric on a proper validation set. Save results to `scores.csv`.
       - The sketch must ensure this step is included. A successful run should always produce scores.
       - `scores.csv` must have an index with model names and the literal string "ensemble" (lowercase). Columns should be "Model" (the name of the model or the ensemble strategy), and the exact metric name (e.g., "AUC").
       - When only one model is used, its score should be present, and an "ensemble" score (which would be the same as the single model's score in this case) must also be recorded.
       - Ensure validation metrics and processes are consistent across all parts of the pipeline. Avoid changes that would alter how validation metrics are calculated unless that is part of the hypothesis.
     9. **Submission File (`submission.csv`)**: Generate `submission.csv` in the **exact format** required (column names, order, data types), as detailed by `sample_submission.csv` in the `Competition Scenario Description`. This is a critical step.
 
-    ### END OF BACKGROUND CONTEXT ###
-
     # Guidelines for Sketching the `main.py` Workflow
 
     YOUR TASK IS TO create a conceptual sketch for drafting or updating the `main.py` workflow. This is a plan, not code.
@@ -354,6 +352,13 @@ task_gen:
       - Confirm no `tqdm` or other progress bars are in the final script.
       - Double-check that validation scores are saved correctly to `scores.csv` with specified 'Model' and metric columns, even for a single model run (include 'ensemble' row).
 
+    # Hyperparameters Specification
+    The workflow will be implemented in the following runtime environment:
+    {{ runtime_environment }}
+
+    Choose hyperparameters to ensure strong performance while meeting resource and time constraints. Specify values only when clearly justified by evidence or strong rationale.
+    {{ hyperparameter_spec }}
+
     {% if task_output_format is not none %}
     ## [Partial Response Format 1] Task Output Format:
     {{ task_output_format }}

diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/proposal.py b/rdagent/scenarios/data_science/proposal/exp_gen/proposal.py
@@ -724,11 +724,14 @@ def task_gen(
             component_info = get_component("Pipeline")
         else:
             component_info = get_component(hypotheses[0].component)
+        runtime_environment = self.scen.get_runtime_environment()
         data_folder_info = self.scen.processed_data_folder_description
         sys_prompt = T(".prompts_v2:task_gen.system").r(
             task_output_format=component_info["task_output_format"] if not self.support_function_calling else None,
             # task_output_format=component_info["task_output_format"],
             component_desc=component_desc,
+            runtime_environment=runtime_environment,
+            hyperparameter_spec=T("scenarios.data_science.share:spec.hyperparameter").r(),
             workflow_check=not pipeline and hypotheses[0].component != "Workflow",
         )
         user_prompt = T(".prompts_v2:task_gen.user").r(

diff --git a/rdagent/scenarios/data_science/share.yaml b/rdagent/scenarios/data_science/share.yaml
@@ -291,13 +291,14 @@ component_spec:
       - Handle missing values and outliers appropriately (e.g., impute, remove, or replace).
       - Ensure consistency between feature data types and transformations.
       - Prevent data leakage: Do not use information derived from the test set when transforming training data.
+      - NEVER sample a subset of data, even when memory is insufficient or time limit is exceeded.
 
     6. Notes:
       - GPU and multiprocessing are available and are encouraged to use for accelerating transformations.
 
     7. Metric Calculation and Storage:
       - Calculate the metric (mentioned in the evaluation section of the competition information) for each model and ensemble strategy on valid, and save the results in `scores.csv`
-      - The evaluation should be based on 5-fold cross-validation but only if that's an appropriate evaluation for the task at hand. Store the mean validation score of 5-fold cross-validation in `scores.csv` on each model.
+      - The evaluation should be based on k-fold cross-validation but only if that's an appropriate evaluation for the task at hand. Store the mean validation score of k-fold cross-validation in `scores.csv` on each model. Refer to the hyperparameter specification for rules to set the CV folds.
       - Even if only one model is present, compute the ensemble score and store it under `"ensemble"`.
       - The index of `scores.csv` should include the model name and the "ensemble" strategy. "ensemble" should be exactly in the index with all lower case letters. Ensemble is the result from several models. If only one model is present, the ensemble score should be the same as the model score.
       - The column names in `scores.csv` should be:
@@ -312,3 +313,18 @@ component_spec:
 guidelines:
   coding: |-
     You might receive exploratory data analysis (EDA) details about the source data. Do not use this EDA information to create assertions or raise errors. We might generate sample data for quick coding (so your code may run on sample data which is part of the full-size data), but remember that the EDA details are based on the full-size data.
+
+spec:
+  hyperparameter: |-
+    1. Hyperparameters Requiring Tuning (e.g., learning rate, weight decay, optimizer, etc.)
+      - Adjust conservatively to avoid instability.
+      - Apply a systematic hyperparameter tuning strategy to identify optimal values.
+    2. Hyperparameters Dependent on Empirical Estimation or Past Failures (e.g., epochs, CV folds, batch size, etc.)
+      - Estimate these parameters based on the runtime environment constraints and experiences from previous experiment failures.
+    3. Balancing Epochs and CV Folds
+      - When runtime permit, prioritize increasing the number of training epochs, but always implement early stopping to prevent overfitting and ensure the process completes within the allowed runtime.
+      - When runtime constrained, first reduce the number of CV folds—provided that validation reliability remains acceptable—before lowering the number of epochs.
+    4. Early Stopping Strategy
+      - Sufficient epochs are completed.
+      - Sufficient low loss are reached.
+      - Sufficient stable validation loss are achieved.