You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Transforms/Other
What happened + What you expected to happen
The web2parquet transform does not process the input files that are provided to it. Instead it uses the transform() method to crawl the single URI passed in on the command line. But to do that, there must be at least 1 input parquet file in the input folder. And worse, if there are N files in the input folder, the URL will be crawled N times to produce N output parquet files.
Reproduction script
The following produces 2 output parquet files that are identical (the same crawl was done twice w/o regard to the contents of the input parquet files:
...
+ python -m dpk_web2parquet.python_runtime --data_local_config '{ '\''input_folder'\'': '\''input'\'', '\''output_folder'\'': '\''output'\''}' --web2parquet_urls https://thealliance.ai/ --web2parquet_depth 1 --web2parquet_downloads 1
13:19:17 INFO - Launching web2parquet transform
13:19:17 INFO - web2parquet parameters are : {'depth': 1, 'downloads': 1, 'folder': None, 'urls': 'https://thealliance.ai/'} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/config.py:75"
13:19:17 INFO - pipeline id pipeline_id
13:19:17 INFO - code location None
13:19:17 INFO - data factory data_ is using local data access: input_folder - input output_folder - output
13:19:17 INFO - data factory data_ max_files -1, n_sample -1
13:19:17 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:19:17 INFO - orchestrator web2parquet started at 2025-01-07 13:19:17
13:19:17 INFO - Number of files is 2, source profile {'max_file_size': 0.00046253204345703125, 'min_file_size': 0.00046253204345703125, 'total_file_size': 0.0009250640869140625}
13:19:17 DEBUG - Received configuration: {'depth': 1, 'downloads': 1, 'folder': None, 'urls': 'https://thealliance.ai/', 'data_access': <data_processing.data_access.data_access_local.DataAccessLocal object at 0x121426d10>, 'statistics': <data_processing.transform.transform_statistics.TransformStatistics object at 0x1214266d0>} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:46"
13:19:25 DEBUG - url: https://thealliance.ai/, filename: thealliance_ai__text.html, content_type: text/html; charset=utf-8 at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:71"
13:19:25 INFO - Crawling is completed in 0.40 seconds at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:104"
13:19:25 INFO - metadata = {'count': 1, 'requested_seeds': 1, 'requested_depth': 1, 'requested_downloads': 1} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:105"
13:19:25 INFO - Completed 1 files (50.0%) in 0.128 min
13:19:25 DEBUG - url: https://thealliance.ai/, filename: thealliance_ai__text.html, content_type: text/html; charset=utf-8 at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:71"
13:19:25 INFO - Crawling is completed in 0.42 seconds at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:104"
13:19:25 INFO - metadata = {'count': 1, 'requested_seeds': 1, 'requested_depth': 1, 'requested_downloads': 1} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:105"
13:19:25 INFO - Completed 2 files (100.0%) in 0.135 min
13:19:25 INFO - Done processing 2 files, waiting for flush() completion.
13:19:25 INFO - done flushing in 0.0 sec
13:19:25 INFO - Completed execution in 0.135 min, execution result 0
+ ls output
metadata.json test.parquet test2.parquet
+ diff output/test.parquet output/test2.parquet
dawood@davids-mbp:~/dpk$
OS
MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
daw3rd
changed the title
[Bug] web2parquet is not a compliantcat transform implementation
[Bug] web2parquet is not a compliant transform implementation
Jan 7, 2025
daw3rd
changed the title
[Bug] web2parquet is not a compliant transform implementation
[Bug] web2parquet is not a conforming transform implementation
Jan 7, 2025
Search before asking
Component
Transforms/Other
What happened + What you expected to happen
The web2parquet transform does not process the input files that are provided to it. Instead it uses the transform() method to crawl the single URI passed in on the command line. But to do that, there must be at least 1 input parquet file in the input folder. And worse, if there are N files in the input folder, the URL will be crawled N times to produce N output parquet files.
Reproduction script
The following produces 2 output parquet files that are identical (the same crawl was done twice w/o regard to the contents of the input parquet files:
Anything else
The above script produces the following:
OS
MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: