Skip to content

Commit 71440f6

Browse files
authored
feat: add hypo_critic and hypo_rewrite in proposal (#1106)
add hypo_critique and hypo_rewrite in proposal.py, controlled by `DS_RD_SETTING.enable_hypo_critique_rewrite` (default True)
1 parent bf1217d commit 71440f6

File tree

4 files changed

+343
-5
lines changed

4 files changed

+343
-5
lines changed

rdagent/app/data_science/conf.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,5 +119,9 @@ class DataScienceBasePropSetting(KaggleBasePropSetting):
119119

120120
model_architecture_suggestion_time_percent: float = 0.75
121121

122+
#### hypothesis critique and rewrite
123+
enable_hypo_critique_rewrite: bool = True
124+
"""Enable hypothesis critique and rewrite stages for improving hypothesis quality"""
125+
122126

123127
DS_RD_SETTING = DataScienceBasePropSetting()

rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml

Lines changed: 163 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,7 @@ hypothesis_gen:
155155
1. **Be Specific and Decisive**:
156156
- Clearly state the exact, unambiguous change(s) being proposed. Avoid vague goals like "improve the model" or "optimize the pipeline."
157157
- The hypothesis must propose a single, clear course of action. Do not suggest alternatives (e.g., "try method A or method B").
158-
- The hypothesis statement must be direct and definitive, without phrases like "for example," "e.g.," or "might involve."
158+
- The hypothesis statement must be direct and definitive, without phrases like "for example," "e.g.," "might involve," "consider," "try," or "explore."
159159
- The hypothesis must be more informative and decisive than the Challenge it addresses. It should not simply restate the Challenge or suggest a general approach without specifics.
160160
2. **Ensure Testability and Actionability**:
161161
- The hypothesis must describe an action or change that can be practically implemented and tested.
@@ -231,6 +231,138 @@ hypothesis_gen:
231231
# Identified Challenges{% if enable_idea_pool %} with Sampled Ideas{% endif %}
232232
{{ problems }}
233233
234+
hypothesis_critique:
235+
system: |-
236+
{% include "scenarios.data_science.share:scen.role" %}
237+
You are an expert critic evaluating machine learning hypotheses for Kaggle competition improvement.
238+
239+
For each hypothesis, provide a focused critique that identifies key issues and suggests improvements while preserving the experimental nature of hypotheses.
240+
241+
## Three Core Evaluation Areas:
242+
243+
### 1. Feasibility Assessment
244+
- **Technical Risk**: Major implementation challenges or resource constraints that could cause failure
245+
- **Integration Issues**: Conflicts with existing code or pipeline components
246+
- **Constraint Violations**: Whether this respects competition time/memory limits based on historical patterns
247+
248+
### 2. Alignment Check
249+
- **Problem-Solution Fit**: Does this actually address the root cause of the identified challenge?
250+
- **Metric Impact**: Will this meaningfully improve the competition's evaluation metric?
251+
- **Historical Context**: Has similar approaches been tried? Key learnings from past attempts?
252+
- **Innovation vs History Balance**: Distinguish between implementation failures (worth retrying with improvements) vs fundamental approach failures (multiple attempts failed due to core unsuitability - should avoid)
253+
254+
### 3. Improvement Direction
255+
- **Clarity Issues**: If vague, identify specific methods or strategies that address the core problem
256+
- **Alternative Strategies**: If implementation is problematic, identify concrete alternative approaches within the current framework such as switching from simple to weighted ensemble
257+
- **Risk Mitigation**: Recommend specific validation strategies or safeguards for high-risk aspects
258+
- **Competition Context**: This is a Kaggle competition where strong performance may come from novel approaches, but also from incremental improvements and careful optimization. Balance innovation with practical enhancements.
259+
260+
## CRITICAL Guidance Rules
261+
262+
- Be specific about methods and strategies, but avoid over-specifying implementation parameters. Suggest clear approaches like "use weighted ensemble instead of simple averaging" rather than exact values like "set weights=[0.3, 0.7]".
263+
- Focus on suggesting CLEAR METHODS and APPROACHES that lead to decisive hypotheses.
264+
- Avoid Overfitting to History: Learn from past failures but don't over-constrain innovation. Distinguish between implementation failures (worth retrying with improvements) and fundamental approach failures (should be avoided).
265+
266+
### Examples:
267+
268+
**Good Critiques:**
269+
- "The hypothesis lacks specificity about which ensemble method to use. Consider weighted averaging based on validation performance rather than simple averaging, given the model performance disparities."
270+
- "This hypothesis proposes LSTM for tabular data. History shows 3 consecutive failures with different LSTM implementations, and tabular data lacks sequential structure. Consider graph-based approaches instead to capture feature relationships."
271+
272+
**Poor Critiques:**
273+
- "Set max_depth=10, learning_rate=0.05, and use 500 trees." (too specific)
274+
- "This might not work." (too vague)
275+
- "LSTM is innovative, let's try again with different hyperparameters." (ignores fundamental mismatch)
276+
277+
{% if critique_output_format is not none %}
278+
## Output Format
279+
{{ critique_output_format }}
280+
{% endif %}
281+
282+
user: |-
283+
# Scenario Description
284+
{{ scenario_desc }}
285+
286+
# Previous Experiments and Feedbacks
287+
{{ exp_and_feedback_list_desc }}
288+
289+
# Current SOTA Implementation
290+
{{ sota_exp_desc }}
291+
292+
# Hypotheses to Critique
293+
{{ hypotheses_formatted }}
294+
295+
hypothesis_rewrite:
296+
system: |-
297+
{% include "scenarios.data_science.share:scen.role" %}
298+
You are an expert hypothesis rewriter specializing in iterative improvement of machine learning solutions for Kaggle competitions.
299+
300+
## Task
301+
Transform each **original hypothesis and its critique** into a **single, specific, testable technical hypothesis** that can be implemented immediately.
302+
303+
## Core Principles
304+
1. **Actionable Critique** – Apply insights from the critique, but the final text must stand alone with **no meta‑discussion** of the critique itself.
305+
2. **Standalone Justification** – Ground every technical decision in dataset characteristics, available compute budget, and competition constraints.
306+
3. **Decisive Specificity** – Remove all ambiguity; propose one clear action.
307+
4. **Innovation Preservation** – Maintain the innovative core of the original hypothesis while addressing implementation concerns. Avoid reverting to conventional approaches unless absolutely necessary.
308+
5. **CRITICAL - Avoid Overfitting to Critique** – Apply critique insights thoughtfully without over-constraining innovation. Balance addressing identified issues with preserving the exploratory value of bold ideas.
309+
310+
## Guidelines for Writing Rewritten Hypotheses
311+
312+
1. **Critique-Informed Specificity**:
313+
- Address technical gaps identified in the critique and replace vague terms with specific algorithms, methods, or parameters.
314+
- Transform general suggestions from the critique into concrete, implementable actions.
315+
- If the critique highlighted feasibility issues, propose alternative approaches that maintain the hypothesis's core intent while being more practical.
316+
- The rewritten hypothesis must be more specific than the original, incorporating the critique's guidance without explicitly referencing it.
317+
318+
2. **Standalone Technical Justification**:
319+
- Ground every technical decision in observable dataset characteristics (e.g., data size, feature types, class distribution).
320+
- Reference competition constraints (time limits, evaluation metrics, submission format) to justify approach choices.
321+
- Ensure the hypothesis can be understood and implemented without needing to read the original hypothesis or critique.
322+
- Include rationale for why the specific method/algorithm chosen is suitable for the current scenario.
323+
324+
3. **Enhanced Actionability and Precision**:
325+
- Replace any remaining ambiguity with decisive technical choices (e.g., "ensemble method" → "weighted averaging based on validation performance").
326+
- Specify validation strategies that will confirm the hypothesis's effectiveness.
327+
- Define clear success criteria or expected outcomes that can be measured.
328+
- If the original hypothesis bundled multiple ideas, focus on the most impactful one identified through the critique.
329+
330+
4. **Risk Mitigation and Implementation Clarity**:
331+
- If the critique identified implementation risks, incorporate specific mitigation strategies into the rewritten hypothesis.
332+
- Address resource constraint concerns by proposing efficient alternatives or optimizations.
333+
- Ensure the hypothesis addresses root causes rather than symptoms, as guided by the critique analysis.
334+
- Make the hypothesis robust against common failure modes identified in the critique.
335+
336+
5. **Pipeline Integration and Component Focus**:
337+
- Clearly specify how the proposed changes integrate with existing SOTA components.
338+
- Maintain focus on the primary component while ensuring compatibility with the overall pipeline.
339+
- If the critique suggested coordination across multiple components, organize these as a unified technical approach rather than separate changes.
340+
- Ensure the rewritten hypothesis preserves successful aspects of the current SOTA while addressing identified weaknesses.
341+
342+
6. **Innovation and Historical Learning**:
343+
- Apply critique insights to enhance sound innovative ideas while avoiding repeated fundamental failures identified in the analysis.
344+
- **Competition Context**: This is a Kaggle competition where strong performance may come from novel approaches or incremental improvements. Enhance both innovative ideas and practical optimizations based on the critique analysis.
345+
346+
347+
{% if rewrite_output_format is not none %}
348+
## Output Format
349+
{{ rewrite_output_format }}
350+
{% endif %}
351+
352+
user: |-
353+
# Scenario Description
354+
{{ scenario_desc }}
355+
356+
# Previous Experiments and Feedbacks
357+
{{ exp_and_feedback_list_desc }}
358+
359+
# Current SOTA Implementation
360+
{{ sota_exp_desc }}
361+
362+
# Original Hypotheses and Their Critiques
363+
{{ hypothesis_critique_pairs }}
364+
365+
234366
task_gen:
235367
system: |-
236368
{% include "scenarios.data_science.share:scen.role" %}
@@ -476,5 +608,35 @@ output_format:
476608
"problem name 2 (should be exactly same as the problem name provided)": 2, # The index which is same to the idea index provided in the input and must be integer.
477609
}
478610
611+
critique: |-
612+
For each hypothesis, provide a comprehensive critique strictly following the JSON schema.
613+
Your final output should be a dict containing critiques for all hypotheses without anything else.
614+
{
615+
"critiques": {
616+
"problem name 1 (should match the hypothesis problem name exactly)": {
617+
"critique": "A comprehensive critique covering: (1) Technical feasibility and potential issues, (2) Alignment with the scenario and competition requirements, (3) Specific improvement suggestions, (4) Overall assessment of the hypothesis quality and implementability. Be constructive and actionable."
618+
},
619+
"problem name 2": {
620+
"critique": "..."
621+
}
622+
}
623+
}
624+
rewrite: |-
625+
For each original hypothesis, rewrite it to address critique feedback, strictly following the JSON schema below.
626+
Your final output should be a dict containing all rewritten hypotheses without anything else.
627+
{
628+
"problem name 1 (should be exactly same as the original problem name without prefix or suffix)": {
629+
"reason": "Independent justification for why this hypothesis makes sense given the current scenario, dataset characteristics, and competition requirements. DO NOT reference critique feedback or suggestions. Should be short with no more than two sentences focusing on the fundamental problem context.",
630+
"component": "The component tag of the hypothesis. Must be one of ('DataLoadSpec', 'FeatureEng', 'Model', 'Ensemble', 'Workflow').",
631+
"hypothesis": "A concise, improved hypothesis statement that directly addresses critique concerns. Limit to one or two sentences that clearly specify the expected change or improvement. Should be more specific and actionable than the original.",
632+
"evaluation": {
633+
"alignment_score": "Score from 1 (lowest/worst) to 10 (highest/best). How directly and effectively does the hypothesis address the core issues of the identified problem it targets? A higher score means a stronger, more direct alignment.",
634+
"impact_score": "Score from 1 (lowest/worst) to 10 (highest/best). What is the estimated magnitude of improvement (e.g., in the primary competition metric, efficiency, robustness, or successful execution) if this hypothesis is successfully implemented? Higher scores for greater positive impact.",
635+
"novelty_score": "Score from 1 (lowest/worst) to 10 (highest/best). How innovative or original is this hypothesis when compared to the approaches and ideas evident in the previous SOTA experiments and previous failed experiments? Assign a score of 1 if the hypothesis is a repeat or substantially similar to a previously attempted hypothesis (whether successful or failed), UNLESS the previous attempt clearly failed due to a trivial implementation bug and the current hypothesis proposes the correct implementation of the same core idea.",
636+
"feasibility_score": "Score from 1 (lowest/worst) to 10 (highest/best). How easily and practically can this hypothesis be implemented and run to completion within the existing SOTA codebase and operational constraints (e.g., allowed time for training/inference, available compute resources, overall complexity)? Higher scores for easier implementation and higher likelihood of successful execution.",
637+
"risk_reward_balance_score": "Score from 1 (lowest/worst) to 10 (highest/best). Considering the potential for significant improvement (reward) versus the probability of failure, negative side-effects, or excessive resource consumption (risk), how optimal is this balance? A high score indicates a favorable balance. If a hypothesis directly and credibly addresses a critical challenge that caused prior experiment failures (e.g., timeout, persistent data loading errors, incorrect submission format preventing any score), this should generally be scored highly (e.g., 8-10).",
638+
}
639+
}
640+
}
479641
480642

0 commit comments

Comments
 (0)