Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] web2parquet is not a conforming transform implementation #920

Open
1 of 2 tasks
daw3rd opened this issue Jan 7, 2025 · 0 comments
Open
1 of 2 tasks

[Bug] web2parquet is not a conforming transform implementation #920

daw3rd opened this issue Jan 7, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@daw3rd
Copy link
Member

daw3rd commented Jan 7, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

The web2parquet transform does not process the input files that are provided to it. Instead it uses the transform() method to crawl the single URI passed in on the command line. But to do that, there must be at least 1 input parquet file in the input folder. And worse, if there are N files in the input folder, the URL will be crawled N times to produce N output parquet files.

Reproduction script

The following produces 2 output parquet files that are identical (the same crawl was done twice w/o regard to the contents of the input parquet files:

rm -rf venv input
python -m venv venv
source venv/bin/activate

pip install --no-cache-dir wheel
pip install --no-cache-dir data-prep-toolkit-transforms[all]==0.2.3

DPK_REPO_DIR=../git/data-prep-kit/
mkdir input
cp ../git/data-prep-kit/transforms/universal/web2parquet/test-data/input/* input
cp input/test.parquet input/test2.parquet
rm -rf output/*
python -m dpk_web2parquet.python_runtime  --data_local_config "{ 'input_folder': 'input', 'output_folder': 'output'}"   \
        --web2parquet_urls 'https://thealliance.ai/' --web2parquet_depth 1 --web2parquet_downloads 1
ls output
diff output/test.parquet output/test2.parquet

Anything else

The above script produces the following:

...
+ python -m dpk_web2parquet.python_runtime --data_local_config '{ '\''input_folder'\'': '\''input'\'', '\''output_folder'\'': '\''output'\''}' --web2parquet_urls https://thealliance.ai/ --web2parquet_depth 1 --web2parquet_downloads 1
13:19:17 INFO - Launching web2parquet transform
13:19:17 INFO - web2parquet parameters are : {'depth': 1, 'downloads': 1, 'folder': None, 'urls': 'https://thealliance.ai/'} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/config.py:75"
13:19:17 INFO - pipeline id pipeline_id
13:19:17 INFO - code location None
13:19:17 INFO - data factory data_ is using local data access: input_folder - input output_folder - output
13:19:17 INFO - data factory data_ max_files -1, n_sample -1
13:19:17 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:19:17 INFO - orchestrator web2parquet started at 2025-01-07 13:19:17
13:19:17 INFO - Number of files is 2, source profile {'max_file_size': 0.00046253204345703125, 'min_file_size': 0.00046253204345703125, 'total_file_size': 0.0009250640869140625}
13:19:17 DEBUG - Received configuration: {'depth': 1, 'downloads': 1, 'folder': None, 'urls': 'https://thealliance.ai/', 'data_access': <data_processing.data_access.data_access_local.DataAccessLocal object at 0x121426d10>, 'statistics': <data_processing.transform.transform_statistics.TransformStatistics object at 0x1214266d0>} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:46"
13:19:25 DEBUG - url: https://thealliance.ai/, filename: thealliance_ai__text.html, content_type: text/html; charset=utf-8 at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:71"
13:19:25 INFO - Crawling is completed in 0.40 seconds at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:104"
13:19:25 INFO - metadata = {'count': 1, 'requested_seeds': 1, 'requested_depth': 1, 'requested_downloads': 1} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:105"
13:19:25 INFO - Completed 1 files (50.0%) in 0.128 min
13:19:25 DEBUG - url: https://thealliance.ai/, filename: thealliance_ai__text.html, content_type: text/html; charset=utf-8 at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:71"
13:19:25 INFO - Crawling is completed in 0.42 seconds at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:104"
13:19:25 INFO - metadata = {'count': 1, 'requested_seeds': 1, 'requested_depth': 1, 'requested_downloads': 1} at "/Users/dawood/dpk/venv/lib/python3.11/site-packages/dpk_web2parquet/transform.py:105"
13:19:25 INFO - Completed 2 files (100.0%) in 0.135 min
13:19:25 INFO - Done processing 2 files, waiting for flush() completion.
13:19:25 INFO - done flushing in 0.0 sec
13:19:25 INFO - Completed execution in 0.135 min, execution result 0
+ ls output
metadata.json   test.parquet    test2.parquet
+ diff output/test.parquet output/test2.parquet
dawood@davids-mbp:~/dpk$ 


OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@daw3rd daw3rd added the bug Something isn't working label Jan 7, 2025
@daw3rd daw3rd changed the title [Bug] web2parquet is not a compliantcat transform implementation [Bug] web2parquet is not a compliant transform implementation Jan 7, 2025
@daw3rd daw3rd changed the title [Bug] web2parquet is not a compliant transform implementation [Bug] web2parquet is not a conforming transform implementation Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant