-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: Import and export tool for csv/parquet/json #181
Add: Import and export tool for csv/parquet/json #181
Conversation
…tjson-arrow-datasets
…to 154-implement-importexport-tool-for-csvparquetjson-arrow-datasets
void import_parquet(ukv_graph_import_t& c, ukv_size_t max_batch_size) { | ||
|
||
arrow::Status status; | ||
arrow::MemoryPool* pool = arrow::default_memory_pool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use our arenas, like in client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not working with parquet.
tools/dataset.cpp
Outdated
|
||
void import_json(ukv_graph_import_t& c, ukv_size_t max_batch_size) { | ||
|
||
std::vector<edge_t> array; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cant we preallocate a max size vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For small data, this may be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tools are only intended for large imports
fix:/refactor: minor
… always converted to true
src/modality_docs.cpp
Outdated
@@ -1281,6 +1287,30 @@ void ukv_docs_write(ukv_docs_write_t* c_ptr) { | |||
linked_memory_lock_t arena = linked_memory(c.arena, c.options, c.error); | |||
return_on_error(c.error); | |||
|
|||
std::vector<ukv_key_t> keys_vec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No std::vector
-s allowed... We have to use our own memory.
src/modality_docs.cpp
Outdated
strided_iterator_gt<ukv_length_t const> lens {c.lengths, c.lengths_stride}; | ||
|
||
for (size_t idx = 0; idx < c.tasks_count; ++idx, ++vals, ++lens) { | ||
simdjson::ondemand::parser parser; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse the state...
tools/dataset.cpp
Outdated
return arrow::Status(arrow::StatusCode::TypeError, "Not supported type"); | ||
} | ||
arrow::Status Visit(arrow::BooleanArray const& arr) { | ||
json = fmt::format("{}{},", json, arr.Value(idx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is horrible.
You are reallocating a new string every time you want to append a boolean.
At least use fmt::format_to
.
tools/dataset.cpp
Outdated
|
||
///////// Helpers ///////// | ||
|
||
class arrow_visitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check styling guidelines.
tools/dataset.cpp
Outdated
size_t idx = 0; | ||
}; | ||
|
||
bool strcmp_(const char* lhs, const char* rhs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style guidelines: East const
docs_vec.reserve(size); | ||
|
||
if (c.fields) { | ||
std::vector<std::string> fields(c.fields_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a std::vector
or std::string
?
tools/dataset.cpp
Outdated
|
||
char file_name[uuid_length]; | ||
make_uuid(file_name); | ||
std::ofstream output(fmt::format("{}{}", file_name, c.paths_extension)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, never use ofstream
or the rest of the old-school IO libraries from STL. Especially on the hot path.
…tjson-arrow-datasets
No description provided.