-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] first version of DynamicAlgorithm class + demo algorithm to demonstrate #395
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what I meant. The Algorithm
class is designed to have its LoadData
method be called once, and then for Execute
to be called multiple times. Thus, for dynamic algorithms I suggest executing the algorithm for the first time in LoadData
, then processing updates in Execute
. Because the names of methods no longer make sense, I then suggested that they should be renamed: LoadData
-> Initialize
, Execute
-> Process
, along with their Python bindings (.def("execute", ...)
-> .def("process", ...)
).
The workflow with the Python constructor doing the initialization is for the future. That's what Algorithm
's bindings should be doing. It would be nice for it to be implemented here, but if not, then that's fine.
So, again, here is what I think should be done for now: execute the dynamic algorithm in LoadData
, process updates in Execute
. Rename the methods and their bindings as above. Leave everything else as is.
With the changes above, the |
aec1696
to
d8b8264
Compare
insert_statements_.Clear(); | ||
delete_statements_.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be achieved if we just provide empty values as defaults for these options. The user has to set all options before executing the algorithm, so they would inevitably become empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields will not be empty, because they are not options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, okay. They store the inserted and deleted lines, right? Isn't it wasteful to store them entirely in memory, though? I think these things should be managed by the algorithm.
if (data->GetNumberOfColumns() != input_table_->GetNumberOfColumns()) { | ||
throw config::ConfigurationError( | ||
"Invalid data received: the number of columns in the \ | ||
modification statements is different from the table."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Option
class allows validation and normalization of values, default values, and it is also possible to have options that are only shown if some condition, which is expected to depend on other option values, holds. The basic checks can be done with those methods. MetricVerifier
uses all the Option
's methods, so you can look there for an example. But option management shouldn't be in the base class anyway. In particular, some algorithms may not support deletion, yet the option would be visible here.
static int CreateId() { | ||
static int id = 1; | ||
return id++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes row IDs unpredictable for the user. They could have several algorithms running, and this will make row numbers inconsistent.
Don't make row IDs static. Row IDs should be indices of the rows in the table that a particular algorithm object remembers rows of (perhaps without actually storing it in memory in its entirety). So I doubt the row itself needs to remember its own ID.
Also, using IDs like that will make some checks easier. An algorithm will only need to store the total number of rows to check if a row can be deleted, for example. And it will be possible to store which rows have been deleted in a bitset.
// configure update statements | ||
ValidateDeleteStatements(update_old_batch_); | ||
ValidateInsertStatements(update_new_batch_); | ||
if (insert_statements_.Size() != delete_statements_.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make more sense to have an object that yields (id, new_data)
pairs instead as a single update
option in a manner similar to IDatasetStream
or just a vector of pairs. This option is one of the cases where checking its value (for example, checking that all the rows have the correct length) completely might not be feasible.
bool algos::DynamicAlgorithm::HasBatch() { | ||
bool result = false; | ||
for (const std::string_view& option_name : kCrudOptions) { | ||
result |= IsOptionSet(option_name); | ||
} | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already taken care of by the configuration system. Default values can be set by Option
to be empty, and we'd just have no rows/IDs/pairs in the relevant field.
} | ||
|
||
void algos::DynamicAlgorithm::MakeExecuteOptsAvailable() { | ||
if (is_initialized_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check does nothing, existing configuration only calls this method after LoadData
has been called.
for (const std::string_view& option_name : kCrudOptions) { | ||
previous_options.erase(option_name); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes these options invisible in a GetNeededOptions
call, but they should be visible.
for (const std::string_view& option_name : kCrudOptions) { | ||
SetOptionByName(algo, option_name, kwargs); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just bypassing the existing configuration infrastructure.
unsigned long long algos::DynamicAlgorithm::ExecuteInternal() { | ||
if (HasBatch()) { | ||
ConfigureBatch(); | ||
unsigned long long time_ms = ProcessBatch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProcessBatch
should just be ExecuteInternal
for every algorithm (unless there are a lot of common things happening, but that's for the future).
.def(py::init([](py::kwargs const& kwargs) { | ||
auto algo = std::make_unique<DynamicAlgorithmDemo>(); | ||
ConfigureAlgo(*algo, kwargs); | ||
algo->LoadData(); | ||
return algo; | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave the constructor as is for now. Use RegisterAlgorithm
No description provided.