Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor for general data science #498

Merged
merged 337 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
337 commits
Select commit Hold shift + click to select a range
12f1217
refine ds modal for more cases: eval and es
TPLin22 Dec 18, 2024
e86aa73
update model template
TPLin22 Dec 18, 2024
cf5e18c
prompts for model and ensemble
WinstonLiyt Dec 18, 2024
81b27b4
fix a bug
WinstonLiyt Dec 18, 2024
dc8f71c
fix a bug
WinstonLiyt Dec 18, 2024
b6acea3
init: ds workflow evovingstrategy
TPLin22 Dec 18, 2024
7f70ce2
Adding ensemble (#505)
xisen-w Dec 18, 2024
62dbcf5
data science loop changes
XianBW Dec 18, 2024
3e240f3
merge pull
XianBW Dec 18, 2024
13fae9a
data science loop base
XianBW Dec 18, 2024
999d133
ds loop feedback
XianBW Dec 19, 2024
b6241cd
fix
XianBW Dec 19, 2024
7e2874f
remove measure_time because it's duplicated (in LoopBase)
XianBW Dec 19, 2024
3335406
add the knowledge query for data_loader & feature
WinstonLiyt Dec 19, 2024
26da5c4
edit ds workflow evaluator
TPLin22 Dec 19, 2024
00ad54e
data_loader bug fix
XianBW Dec 19, 2024
35a1db9
stop evolving when all tasks completed
XianBW Dec 19, 2024
f96f9a2
llm app change
XianBW Dec 20, 2024
a0a3db5
fix break all complete strategy
peteryang1 Dec 20, 2024
74a2829
Adding queried knowledge (#508)
xisen-w Dec 20, 2024
737bdb9
fix loop bug
XianBW Dec 20, 2024
6234966
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 20, 2024
ab41352
ds workflow evaluator; test; refine prompts
TPLin22 Dec 20, 2024
c2ed6e1
workflow spec
WinstonLiyt Dec 20, 2024
02ddf81
fix ci
WinstonLiyt Dec 20, 2024
bfa455a
feature task changes
XianBW Dec 20, 2024
61f0cb8
ds loop change
XianBW Dec 23, 2024
251688b
fix a bug in feat
WinstonLiyt Dec 23, 2024
438a569
add query knowledge for model and workflow
WinstonLiyt Dec 23, 2024
6497957
llm_debug info(for show) using pickle instead of json
XianBW Dec 23, 2024
4b7f4f2
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 23, 2024
3920a5c
remove NextLoopException
peteryang1 Dec 23, 2024
e8a85a6
loop change
XianBW Dec 23, 2024
3114f78
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 23, 2024
5845173
coder raise CoderError when all sub_tasks failed
XianBW Dec 23, 2024
3db73f0
rename code_dict to file_dict in FBWorkspace
XianBW Dec 23, 2024
7f85fdf
add CoSTEER unittest
XianBW Dec 23, 2024
9009b73
now show self.version in Task.get_task_information(), simplify CoSTEE…
XianBW Dec 23, 2024
39abb25
remove some properties in ModelTask, add model_type in it.
XianBW Dec 23, 2024
a6505d1
fix llm app bug
XianBW Dec 24, 2024
87dea18
llm web app bug fix
XianBW Dec 24, 2024
d2d88d9
ds loop bug fix
XianBW Dec 24, 2024
e8c2d6c
fix: give component code to feature&ens eval
XianBW Dec 24, 2024
0722d77
loop catch error bug
XianBW Dec 25, 2024
b53e03e
rename load_from_raw_data to load_data
XianBW Dec 25, 2024
01ad2e9
feat: Add debug data creation functionality for data science scenarios
you-n-g Dec 25, 2024
db1455b
support local folder (#511)
qew21 Dec 25, 2024
12a27ec
update sample data script
qew21 Dec 25, 2024
de2825f
make sure frac < 1
qew21 Dec 25, 2024
9a4ba5f
fix a bug
WinstonLiyt Dec 25, 2024
75b40d8
feature spec changes
XianBW Dec 25, 2024
84f0fb8
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 25, 2024
e8f2410
fix
XianBW Dec 25, 2024
418c2ce
changeimport order
qew21 Dec 26, 2024
fe10da7
clear unnecessary std outputs
WinstonLiyt Dec 26, 2024
a4e3ced
fix a typo
WinstonLiyt Dec 26, 2024
e009fd7
create sample folder after unzip kaggle data
qew21 Dec 26, 2024
36d26ee
feature/model test script update
XianBW Dec 26, 2024
08df71a
Align the data types across modules.
WinstonLiyt Dec 26, 2024
c02fd79
fix a bug in model eval
WinstonLiyt Dec 26, 2024
d3e3f60
show line number
XianBW Dec 26, 2024
fa21c04
move sample entry point to app
qew21 Dec 26, 2024
6682711
spec & model prompt changes
XianBW Dec 27, 2024
f8113b2
Refine the competition specification to address the data type problem…
WinstonLiyt Dec 27, 2024
36b4191
fix some bugs
WinstonLiyt Dec 27, 2024
34aa750
add file filter in FBworkspace.code property
XianBW Dec 27, 2024
d30ff40
support non-binary prediction
qew21 Dec 27, 2024
72bfa90
avoid too much warnings
qew21 Dec 27, 2024
d8b5a4c
fix a bug in ensemble module
WinstonLiyt Dec 30, 2024
ea39d9f
filtered the knowledge query in all modules
WinstonLiyt Dec 30, 2024
ed305c1
delete RAG in idea proposal
WinstonLiyt Dec 30, 2024
d9d29b3
refine the code in ensemble
WinstonLiyt Dec 30, 2024
593854c
show exp workspace in llm_st
XianBW Dec 30, 2024
77d8b8b
exp_gen bug fix
XianBW Dec 30, 2024
6eb92ab
feedback bug fix
XianBW Dec 30, 2024
ab2aab4
use `feature` instead of `feat01`
XianBW Dec 30, 2024
7ef6a0e
Trace & method of judging if exp is completed change
XianBW Dec 31, 2024
5da62ac
fix a bug in package calling and execute ci
WinstonLiyt Jan 2, 2025
0d59c6c
fix code
qew21 Jan 2, 2025
716ba1d
bug fix
XianBW Jan 2, 2025
2e014f9
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 2, 2025
7c046cf
bug fix
XianBW Jan 2, 2025
f722e6c
fix a bug
WinstonLiyt Jan 2, 2025
d44942d
fix some bugs
WinstonLiyt Jan 2, 2025
4e9aff3
fix a bug
WinstonLiyt Jan 2, 2025
288777d
refactor: Enhance error handling and feedback in data science loop
you-n-g Jan 2, 2025
5e2adfa
support different use_azure on chat and embedding models
peteryang1 Jan 2, 2025
2abdb96
multi-model proposal logic
WinstonLiyt Jan 2, 2025
e4af411
fix a small syntax error
peteryang1 Jan 2, 2025
f97092f
loopBase and some changes
XianBW Jan 2, 2025
89b7bec
merge pull
XianBW Jan 2, 2025
386fffa
ensemble scores change
XianBW Jan 2, 2025
749c6ca
fbworkspace.code -> .all_codes
XianBW Jan 2, 2025
f159b11
use all model codes in workflow coder
XianBW Jan 2, 2025
3b22e7c
check scores.csv's keys(model_names)
XianBW Jan 2, 2025
f4b1dd2
model name changes
XianBW Jan 2, 2025
ff710d6
add a todo in ensemble test
XianBW Jan 2, 2025
aac5349
sota_exp changes
XianBW Jan 2, 2025
07a3ef7
give model info in exp gen
XianBW Jan 2, 2025
f6b55f6
add runner time limit
XianBW Jan 3, 2025
9f4c84d
config using debug data or not in evals
XianBW Jan 3, 2025
f51b35e
exp to feedback base
XianBW Jan 3, 2025
3b2f15c
add feature code when writing model task
XianBW Jan 3, 2025
9240820
small problem
XianBW Jan 3, 2025
82d0635
copying during sampling
qew21 Jan 3, 2025
19e9b4c
update
peteryang1 Jan 3, 2025
3e945d2
Merge branch 'xuyang1/several_small_code_update_to_ds_refactor' into …
peteryang1 Jan 3, 2025
1fb000f
refactor: Simplify code handling and improve workspace management
you-n-g Jan 3, 2025
615d3b5
model part output fix
XianBW Jan 3, 2025
86180bf
print model's execution time
qew21 Jan 3, 2025
cfda303
bug fix
XianBW Jan 6, 2025
2ddcb24
ensemble test fix
XianBW Jan 6, 2025
28576df
ens small change
XianBW Jan 6, 2025
271e5f1
ens_test bug fix
XianBW Jan 6, 2025
8238802
Refine partial expansion logic to display only a few subfolders when …
WinstonLiyt Jan 6, 2025
43f8c1f
several update on prompts
peteryang1 Jan 6, 2025
3fd376c
Merge branch 'xuyang1/several_update_on_prompts' into ds_refactor
peteryang1 Jan 6, 2025
89000dd
Merge branch 'ds_refactor' into MM
XianBW Jan 6, 2025
1f5ce9a
sample subfolders
qew21 Jan 6, 2025
9900495
Filter the stdout after code execution to remove irrelevant informati…
WinstonLiyt Jan 6, 2025
6c23e7d
Add some more prompts and comments
you-n-g Jan 6, 2025
9295094
several update on the first init rounds
peteryang1 Jan 6, 2025
488a7eb
Merge branch 'xuyang1/several_new_updates' into ds_refactor
peteryang1 Jan 6, 2025
edeb337
model timeout as error
qew21 Jan 7, 2025
5e4f544
fix pattern of getting model codes in workspace
XianBW Jan 7, 2025
d657834
small bux fix on model prompts
peteryang1 Jan 7, 2025
0e9f0e2
Merge branch 'xuyang1/small_update_on_model_prompts' into ds_refactor
peteryang1 Jan 7, 2025
eb89153
remove get_code_with_key since we have regex pattern
peteryang1 Jan 7, 2025
9d27fe7
fix: Correct tqdm progress bar update logic in LoopBase class
you-n-g Jan 7, 2025
0e671ab
feat: Add diff generation and enhance feedback mechanism in data scie…
you-n-g Jan 7, 2025
a300ae4
update some fix to model and workflow prompts
peteryang1 Jan 7, 2025
684ca66
Merge branch 'xuyang1/several_update_on_model_and_workflow_prompt' in…
peteryang1 Jan 7, 2025
84d4891
refine the logic of progress bar filter
WinstonLiyt Jan 7, 2025
3e58cb5
add last_successful_exp in exp_gen
peteryang1 Jan 7, 2025
301b0c0
fix a one line bug
peteryang1 Jan 7, 2025
29e7149
add a hint in prompt
peteryang1 Jan 7, 2025
7e3d774
fix data sample for bms
qew21 Jan 7, 2025
67fdceb
fix data sample for bms
qew21 Jan 7, 2025
b35d7fb
hypothesis small fix
qew21 Jan 7, 2025
c86bbd9
crawler readme update
XianBW Jan 8, 2025
2018bc8
fix component gen
qew21 Jan 8, 2025
40faf5a
fix bug
XianBW Jan 8, 2025
4c1d8b6
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 8, 2025
7cbf7a3
annotation change
XianBW Jan 8, 2025
3ce8242
load description.md if it exists
peteryang1 Jan 8, 2025
ae29d1d
refactor: Simplify SOTA description handling in feedback and prompts
you-n-g Jan 8, 2025
8492cd4
refactor: Use shared templates for feedback and experiment descriptions
you-n-g Jan 8, 2025
06515e8
change webapp for model codes changes
XianBW Jan 8, 2025
87cda02
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 8, 2025
71506af
update proposal
qew21 Jan 8, 2025
0f4073e
add timeout message for docker run output
XianBW Jan 8, 2025
b9b9c76
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 8, 2025
797acd5
fix
XianBW Jan 8, 2025
56d57ac
refine the code in docker time processing
WinstonLiyt Jan 8, 2025
6d8f476
use .shape instead of len() when do shape eval
XianBW Jan 8, 2025
02004b9
won't change size during iteration
qew21 Jan 8, 2025
ac145dc
support bson sample
qew21 Jan 8, 2025
c88a23d
sample support jsonl and bson
qew21 Jan 9, 2025
1895846
add former_code to coder prompts
peteryang1 Jan 9, 2025
0d17ad9
a little speed us in debug data creating
peteryang1 Jan 9, 2025
7adb539
filter progress bar when eval ens and main
XianBW Jan 9, 2025
cc2d18b
Merge commit 'af6af11' into HEAD
you-n-g Jan 9, 2025
262e242
avoid costeer makes no change to former code
peteryang1 Jan 9, 2025
462982a
fix several log error
peteryang1 Jan 9, 2025
10de120
add timeout judge threshold
XianBW Jan 9, 2025
e096bec
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 9, 2025
c1c9f93
fix some bugs in the evaluation of component output shapes
WinstonLiyt Jan 9, 2025
fdbb4b8
File structure for supporting litellm (#517)
YeewahChan Jan 9, 2025
0c919ed
ignore submission and show processing
qew21 Jan 9, 2025
3096cf5
ignore submission and show processing
qew21 Jan 9, 2025
c9ef301
add efficiency notice
peteryang1 Jan 9, 2025
376d840
refactor: Enhance error message with detailed feedback summary
you-n-g Jan 9, 2025
814b06e
refactor: Simplify component handling in DSExpGen class
you-n-g Jan 9, 2025
da60a21
refactor: Update code structure and add docstring for clarity
you-n-g Jan 9, 2025
c2b41fa
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 10, 2025
fe88b07
reserve one sample to each label in data sampling
peteryang1 Jan 10, 2025
a26b80e
add Evaluation info
qew21 Jan 10, 2025
091d687
refine costeer code to avoid giving same code twice
peteryang1 Jan 10, 2025
c58d5f6
use raw_description as plain text
peteryang1 Jan 10, 2025
9259839
add a prompt hint to avoid same dict key
peteryang1 Jan 10, 2025
cbadaa5
model task name bug in first model exp gen
XianBW Jan 10, 2025
dc01b8a
Merge branches 'ds_refactor' and 'ds_refactor' of github.com:microsof…
XianBW Jan 10, 2025
059f36a
fix a typo
peteryang1 Jan 10, 2025
13ae2ae
add some debug info in costeer tests
peteryang1 Jan 10, 2025
e516199
task init change
XianBW Jan 10, 2025
5cee384
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 10, 2025
07c5e40
enhance data sampling
qew21 Jan 13, 2025
f7d349a
refine the code in data_loader
WinstonLiyt Jan 13, 2025
189330f
more reasonable loop
you-n-g Jan 13, 2025
cc62e16
fix a bug in data folder description
WinstonLiyt Jan 13, 2025
39b5c25
add error msg & traceback to execution feedback
XianBW Jan 13, 2025
ad14b2a
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 13, 2025
7fab381
fix llm error msg detection
XianBW Jan 13, 2025
51c247e
add task information to costeer eval & add cache to docker run(use zi…
peteryang1 Jan 13, 2025
f496bfc
fix CI first round
peteryang1 Jan 13, 2025
938bcac
fix CI second round
peteryang1 Jan 13, 2025
e979d56
use txt to store test script to avoid pytest
peteryang1 Jan 13, 2025
016189b
remove zipfile in requirements
peteryang1 Jan 13, 2025
b6e6a0a
add azure.identity to requirements
peteryang1 Jan 13, 2025
031cb1d
ignore debug web page
peteryang1 Jan 14, 2025
edfe179
component test changes
XianBW Jan 14, 2025
01874ca
remove redundent task_desc in model coder
XianBW Jan 14, 2025
6f634a5
feat: Add APE module and prompts for automated prompt engineering
you-n-g Jan 13, 2025
5cff5fc
fix: Update .gitignore and improve text formatting in eval.py
you-n-g Jan 14, 2025
d68bc14
refactor: Update print output and improve code comments and imports
you-n-g Jan 14, 2025
0171b41
style: Fix string formatting and import order in ape.py and fmt.py
you-n-g Jan 14, 2025
e5e218f
exclude ape
you-n-g Jan 14, 2025
082efaf
add a data folder notice
peteryang1 Jan 14, 2025
c8cf383
reduce unnecessary output to stdout
WinstonLiyt Jan 14, 2025
7474c97
refine the code of describe_data_folder
WinstonLiyt Jan 14, 2025
cec7a91
fix ci
WinstonLiyt Jan 14, 2025
5a5d0cb
style: streamlit style update (#522)
qew21 Jan 15, 2025
1ede9d6
fix llm_st loop progress bar
XianBW Jan 15, 2025
b68c724
debugapp small change
XianBW Jan 15, 2025
a3d3e15
fix model str
qew21 Jan 15, 2025
eff7d22
refine some prompts
WinstonLiyt Jan 15, 2025
8b63176
fix model str
qew21 Jan 15, 2025
37de5a3
fix CI
peteryang1 Jan 15, 2025
58ecf57
refine the logic associated with the data_folder
WinstonLiyt Jan 15, 2025
1ccba61
fix ci
WinstonLiyt Jan 15, 2025
8968482
small change
XianBW Jan 15, 2025
1456331
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 15, 2025
1547cce
set filter_progress_bar as default in execute
peteryang1 Jan 15, 2025
6be3fc7
model proposal with workflow
qew21 Jan 15, 2025
c3ddfb0
add submission check in workflow eval
XianBW Jan 15, 2025
e93ad4c
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 15, 2025
0e88ee0
fix bug
XianBW Jan 15, 2025
b8852d1
small change
XianBW Jan 15, 2025
8086e39
fix CI
XianBW Jan 15, 2025
953a1d7
fix CI
XianBW Jan 15, 2025
ad47360
refactor: Move generate_diff to utils and update DSExpGen logic
you-n-g Jan 16, 2025
534c398
more reasonable prompt describing metric direction
peteryang1 Jan 16, 2025
97c9502
fix a minor jinja2 bug
peteryang1 Jan 16, 2025
bd402fa
quick fix exp_gen bugs
peteryang1 Jan 16, 2025
7f91db3
fix the following bug
peteryang1 Jan 16, 2025
432e2b5
fix
WinstonLiyt Jan 16, 2025
ba7db06
fix some bugs
WinstonLiyt Jan 16, 2025
7a6669f
remove workflow from model
qew21 Jan 16, 2025
f005938
add pending_tasks_list in data science to enable coding model and wor…
peteryang1 Jan 16, 2025
8516e48
refine the code for handling JSON-formatted data descriptions
WinstonLiyt Jan 16, 2025
d2959b0
assert with information
qew21 Jan 16, 2025
ea2ff15
ensure correct csv file name
qew21 Jan 16, 2025
2741121
add logging to help record the output
peteryang1 Jan 17, 2025
b816570
log competition
peteryang1 Jan 17, 2025
e572aa1
add log tag for debug llm app
XianBW Jan 17, 2025
ae0ec76
test: Test ds refactor ll (#523)
SunsetWolf Jan 17, 2025
626296c
fix CI
peteryang1 Jan 17, 2025
a826c2b
Update rdagent/app/data_science/loop.py
peteryang1 Jan 17, 2025
734700f
add samplecsv into spec prompts
TPLin22 Jan 17, 2025
792660b
fix CI
peteryang1 Jan 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
Pipfile
public
release-notes.md
typescript*

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -170,3 +171,4 @@ mlruns/
# shell script
*.out
*.sh
.aider*
7 changes: 5 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,10 @@ explicit_package_bases = true
warn_return_any = true
warn_unused_ignores = true

[[tool.mypy.overrides]]
ignore_missing_imports = true
module = "llama"

[tool.pytest.ini_options]
addopts = "-l -s --durations=0"
log_cli = true
Expand All @@ -77,7 +81,6 @@ src = ["rdagent"]
[tool.ruff.lint]
ignore = [
# https://docs.astral.sh/ruff/rules/#pydocstyle-d
"ANN101",
"ANN401",
"D",
"ERA001",
Expand All @@ -88,7 +91,7 @@ ignore = [
"S101",
"S301",
"T20",
"TCH003",
"TC003",
"TD",
]
select = ["ALL"]
Expand Down
2 changes: 1 addition & 1 deletion rdagent/app/data_mining/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ class MedBasePropSetting(BasePropSetting):
runner: str = "rdagent.scenarios.data_mining.developer.model_runner.DMModelRunner"
"""Runner class"""

summarizer: str = "rdagent.scenarios.data_mining.developer.feedback.DMModelHypothesisExperiment2Feedback"
summarizer: str = "rdagent.scenarios.data_mining.developer.feedback.DMModelExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
Expand Down
49 changes: 49 additions & 0 deletions rdagent/app/data_science/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from rdagent.app.kaggle.conf import KaggleBasePropSetting
from rdagent.core.conf import ExtendedSettingsConfigDict


class DataScienceBasePropSetting(KaggleBasePropSetting):
model_config = ExtendedSettingsConfigDict(env_prefix="DS_", protected_namespaces=())

# Main components
## Scen
scen: str = "rdagent.scenarios.data_science.scen.KaggleScen"
"""Scenario class for data mining model"""

## proposal
exp_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.DSExpGen"
# exp_gen_init_kwargs: dict = {"max_trace_hist": 3} # TODO: to be configurable

# the two below should be used in ExpGen
# hypothesis_gen: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesisGen"
# """Hypothesis generation class"""
#
# hypothesis2experiment: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesis2Experiment"
# """Hypothesis to experiment class"""

## dev/coder
data_loader_coder: str = "rdagent.components.coder.data_science.raw_data_loader.DataLoaderCoSTEER"
"""Data Loader CoSTEER"""

# feature_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGFactorCoSTEER"
# """Feature Coder class"""

# model_feature_selection_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGModelFeatureSelectionCoder"
# """Model Feature Selection Coder class"""

# model_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGModelCoSTEER"
# """Model Coder class"""

## dev/runner
feature_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGFactorRunner"
"""Feature Runner class"""

model_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGModelRunner"
"""Model Runner class"""

## feedback
summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGExperiment2Feedback"
"""Summarizer class"""


DS_RD_SETTING = DataScienceBasePropSetting()
6 changes: 6 additions & 0 deletions rdagent/app/data_science/debug.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import fire

from rdagent.scenarios.data_science.debug.data import create_debug_data

if __name__ == "__main__":
fire.Fire(create_debug_data)
163 changes: 163 additions & 0 deletions rdagent/app/data_science/loop.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
from pathlib import Path
from typing import Any

import fire

from rdagent.app.data_science.conf import DS_RD_SETTING
from rdagent.components.coder.data_science.ensemble import EnsembleCoSTEER
from rdagent.components.coder.data_science.feature import FeatureCoSTEER
from rdagent.components.coder.data_science.model import ModelCoSTEER
from rdagent.components.coder.data_science.raw_data_loader import DataLoaderCoSTEER
from rdagent.components.coder.data_science.workflow import WorkflowCoSTEER
from rdagent.components.workflow.conf import BasePropSetting
from rdagent.components.workflow.rd_loop import RDLoop
from rdagent.core.exception import CoderError, RunnerError
from rdagent.core.proposal import ExperimentFeedback, HypothesisFeedback
from rdagent.core.scenario import Scenario
from rdagent.core.utils import import_class
from rdagent.log import rdagent_logger as logger
from rdagent.scenarios.data_science.dev.feedback import DSExperiment2Feedback
from rdagent.scenarios.data_science.dev.runner import DSRunner
from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
from rdagent.scenarios.data_science.proposal.exp_gen import DSExpGen, DSTrace
from rdagent.scenarios.kaggle.kaggle_crawler import download_data


class DataScienceRDLoop(RDLoop):
skip_loop_error = (CoderError, RunnerError)

def __init__(self, PROP_SETTING: BasePropSetting):
logger.log_object(PROP_SETTING.competition, tag="competition")
scen: Scenario = import_class(PROP_SETTING.scen)(PROP_SETTING.competition)

### shared components in the workflow # TODO: check if
knowledge_base = (
import_class(PROP_SETTING.knowledge_base)(PROP_SETTING.knowledge_base_path, scen)
if PROP_SETTING.knowledge_base != ""
else None
)

# 1) task generation from scratch
# self.scratch_gen: tuple[HypothesisGen, Hypothesis2Experiment] = DummyHypothesisGen(scen),

# 2) task generation from a complete solution
# self.exp_gen: ExpGen = import_class(PROP_SETTING.exp_gen)(scen)
self.exp_gen = DSExpGen(scen)
self.data_loader_coder = DataLoaderCoSTEER(scen)
self.feature_coder = FeatureCoSTEER(scen)
self.model_coder = ModelCoSTEER(scen)
self.ensemble_coder = EnsembleCoSTEER(scen)
self.workflow_coder = WorkflowCoSTEER(scen)

self.runner = DSRunner(scen)
# self.summarizer: Experiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
# logger.log_object(self.summarizer, tag="summarizer")

# self.trace = KGTrace(scen=scen, knowledge_base=knowledge_base)
self.trace = DSTrace(scen=scen)
self.summarizer = DSExperiment2Feedback(scen)
super(RDLoop, self).__init__()

def direct_exp_gen(self, prev_out: dict[str, Any]):
exp = self.exp_gen.gen(self.trace)
logger.log_object(exp, tag="direct_exp_gen")

# FIXME: this is for LLM debug webapp, remove this when the debugging is done.
logger.log_object(exp, tag="debug_exp_gen")
return exp

def coding(self, prev_out: dict[str, Any]):
exp = prev_out["direct_exp_gen"]
for tasks in exp.pending_tasks_list:
exp.sub_tasks = tasks
if exp.hypothesis.component == "DataLoadSpec":
exp = self.data_loader_coder.develop(exp)
elif exp.hypothesis.component == "FeatureEng":
exp = self.feature_coder.develop(exp)
elif exp.hypothesis.component == "Model":
exp = self.model_coder.develop(exp)
elif exp.hypothesis.component == "Ensemble":
exp = self.ensemble_coder.develop(exp)
elif exp.hypothesis.component == "Workflow":
exp = self.workflow_coder.develop(exp)
else:
raise NotImplementedError(f"Unsupported component in DataScienceRDLoop: {exp.hypothesis.component}")
exp.sub_tasks = []
logger.log_object(exp, tag="coding")
return exp

def running(self, prev_out: dict[str, Any]):
exp: DSExperiment = prev_out["coding"]
if exp.next_component_required() is None:
new_exp = self.runner.run(exp)
logger.log_object(new_exp, tag="running")
return new_exp
else:
return exp

def feedback(self, prev_out: dict[str, Any]) -> ExperimentFeedback:
exp: DSExperiment = prev_out["running"]
if exp.next_component_required() is None:
feedback = self.summarizer.generate_feedback(exp, self.trace)
else:
feedback = ExperimentFeedback(
reason=f"{exp.hypothesis.component} is completed.",
decision=True,
)
logger.log_object(feedback, tag="feedback")
return feedback

def record(self, prev_out: dict[str, Any]):
e = prev_out.get(self.EXCEPTION_KEY, None)
if e is None:
self.trace.hist.append((prev_out["running"], prev_out["feedback"]))
else:
self.trace.hist.append(
(
prev_out["direct_exp_gen"] if isinstance(e, CoderError) else prev_out["coding"],
ExperimentFeedback.from_exception(e),
)
)
logger.log_object(self.trace, tag="trace")
logger.log_object(self.trace.sota_experiment(), tag="SOTA experiment")


def main(path=None, step_n=None, competition="bms-molecular-translation"):
"""

Parameters
----------
path :
path like `$LOG_PATH/__session__/1/0_propose`. It indicates that we restore the state that after finish the step 0 in loop1
step_n :
How many steps to run; if None, it will run forever until error or KeyboardInterrupt
competition :


Auto R&D Evolving loop for models in a Kaggle scenario.
You can continue running session by
.. code-block:: bash
dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose --step_n 1 # `step_n` is a optional parameter
rdagent kaggle --competition playground-series-s4e8 # You are encouraged to use this one.
"""
if competition is not None:
DS_RD_SETTING.competition = competition

if DS_RD_SETTING.competition:
if DS_RD_SETTING.scen.endswith("KaggleScen"):
download_data(competition=DS_RD_SETTING.competition, settings=DS_RD_SETTING)
else:
if not Path(f"{DS_RD_SETTING.local_data_path}/{competition}").exists():
logger.error(f"Please prepare data for competition {competition} first.")
return
else:
logger.error("Please specify competition name.")
if path is None:
kaggle_loop = DataScienceRDLoop(DS_RD_SETTING)
else:
kaggle_loop = DataScienceRDLoop.load(path)
kaggle_loop.run(step_n=step_n)


if __name__ == "__main__":
fire.Fire(main)
28 changes: 12 additions & 16 deletions rdagent/app/kaggle/conf.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
from rdagent.components.workflow.conf import BasePropSetting
from rdagent.core.conf import ExtendedSettingsConfigDict
from rdagent.core.conf import ExtendedBaseSettings, ExtendedSettingsConfigDict


class KaggleBasePropSetting(BasePropSetting):
class KaggleBasePropSetting(ExtendedBaseSettings):
model_config = ExtendedSettingsConfigDict(env_prefix="KG_", protected_namespaces=())

# 1) overriding the default
Expand Down Expand Up @@ -30,7 +29,7 @@ class KaggleBasePropSetting(BasePropSetting):
model_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGModelRunner"
"""Model Runner class"""

summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGHypothesisExperiment2Feedback"
summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
Expand All @@ -45,12 +44,21 @@ class KaggleBasePropSetting(BasePropSetting):
local_data_path: str = ""
"""Folder storing Kaggle competition data"""

if_using_mle_data: bool = False
auto_submit: bool = False
"""Automatically upload and submit each experiment result to Kaggle platform"""
# Conditionally set the knowledge_base based on the use of graph RAG
knowledge_base: str = ""
"""Knowledge base class, uses 'KGKnowledgeGraph' when advanced graph-based RAG is enabled, otherwise empty."""
if_action_choosing_based_on_UCB: bool = False
"""Enable decision mechanism based on UCB algorithm"""

domain_knowledge_path: str = "/data/userdata/share/kaggle/domain_knowledge"
"""Folder storing domain knowledge files in .case format"""

knowledge_base_path: str = "kg_graph.pkl"
"""Advanced version of graph-based RAG"""

rag_path: str = "git_ignore_folder/kaggle_vector_base.pkl"
"""Base version of vector-based RAG"""

Expand All @@ -60,20 +68,8 @@ class KaggleBasePropSetting(BasePropSetting):
if_using_graph_rag: bool = False
"""Enable advanced graph-based RAG"""

# Conditionally set the knowledge_base based on the use of graph RAG
knowledge_base: str = ""
"""Knowledge base class, uses 'KGKnowledgeGraph' when advanced graph-based RAG is enabled, otherwise empty."""

knowledge_base_path: str = "kg_graph.pkl"
"""Advanced version of graph-based RAG"""

auto_submit: bool = False
"""Automatically upload and submit each experiment result to Kaggle platform"""

mini_case: bool = False
"""Enable mini-case study for experiments"""

if_using_mle_data: bool = False


KAGGLE_IMPLEMENT_SETTING = KaggleBasePropSetting()
Loading
Loading