Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce GoogleDrive Fetcher for tika-pipes #2077

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

bartek
Copy link

@bartek bartek commented Dec 6, 2024

This introduces a Google Drive fetcher into the tika-pipes project

@bartek bartek mentioned this pull request Dec 6, 2024
@bartek
Copy link
Author

bartek commented Dec 6, 2024

@tballison Cleaned up version of #2074 now that I sorted out what's going on with the tika-grpc-3x-features branch (it will require cleanup)

@THausherr
Copy link
Contributor

THausherr commented Dec 7, 2024

The pom.xml has several properties that are not used, e.g. kiota, wiremock, nimbus, etc. You probably copied them from another (older) pom.xml. I think you only need the first one.

@bartek
Copy link
Author

bartek commented Dec 9, 2024

The pom.xml has several properties that are not used, e.g. kiota, wiremock, nimbus, etc. You probably copied them from another (older) pom.xml. I think you only need the first one.

Thanks, I was indeed copying and did not review the pom.xml too deeply. Cleaned up: 880a2ea

@THausherr
Copy link
Contributor

There are still 3 that you don't need, one that isn't used and two that are defined in the parent.

@bartek bartek force-pushed the bartek/tika-fetcher-google branch from 880a2ea to 0884a58 Compare December 9, 2024 10:46
@bartek
Copy link
Author

bartek commented Dec 9, 2024

There are still 3 that you don't need, one that isn't used and two that are defined in the parent.

I think I got them in the final push. Is there a way to identify these during build? I didn't notice it in the output

@THausherr THausherr changed the title Introduce GoogleDriver Fetcher for tika-pipes Introduce GoogleDrive Fetcher for tika-pipes Dec 9, 2024
@THausherr
Copy link
Contributor

I got these by looking at the source code. This is just me, I like smaller pom.xml files that are easier to understand and maintain.

Is it possible for you to create some sort of unit test, or is this impossible because one would need some google drive access?

@bartek
Copy link
Author

bartek commented Dec 9, 2024

I got these by looking at the source code. This is just me, I like smaller pom.xml files that are easier to understand and maintain.

Thanks for the commits and notes. I'm not too familiar with the project and am entering through tika-pipes and its fetcher requirements, so I appreciate your patience.

Is it possible for you to create some sort of unit test, or is this impossible because one would need some google drive access?

I imagine we could mock the response from Google Drive, so at least we test happy/sad paths. Let me have a try at it.

@THausherr
Copy link
Contributor

THausherr commented Dec 9, 2024

So I was able to fix the google driver fetcher pom.xml, but now the pipes gRPC server is failing with dependency convergence errors. 😬 I'll try to fix that too.

@THausherr
Copy link
Contributor

I managed to do a complete build locally, mostly by moving the dependencyManagement stuff I introduced to the parent. I'll do another test locally and then add this here.

@bartek
Copy link
Author

bartek commented Dec 9, 2024

@THausherr Looks like you got this building. Are you happy with your changes? If so I will squash them into a single commit.

@THausherr
Copy link
Contributor

Yes!

@bartek
Copy link
Author

bartek commented Dec 9, 2024

@THausherr Great. Btw, since these changes, I am unable to build tika-pipes (which is what I am building, not the whole project). It looks like the pom.xml that was previously expected no longer is applicable. Are you able to help?

Here's the error:

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:33 min
[INFO] Finished at: 2024-12-09T12:47:12-04:00
[INFO] ------------------------------------------------------------------------
+ mvn dependency:copy-dependencies -f /Users/bartek/workspace/tika/tika-pipes/tika-grpc/example-dockerfile/../../../tika-pipes/tika-grpc
[INFO] Scanning for projects...
[ERROR] [ERROR] Some problems were encountered while processing the POMs:
[FATAL] Non-readable POM /Users/bartek/workspace/tika/tika-pipes/tika-grpc/example-dockerfile/../../../tika-pipes/tika-grpc/pom.xml: /Users/bartek/workspace/tika/tika-pipes/tika-grpc/example-dockerfile/../../../tika-pipes/tika-grpc/pom.xml (No such file or directory) @ 

And here's my build script:

set -x

TAG_NAME=$1

if [ -z "${TAG_NAME}" ]; then
    echo "Single command line argument is required which will be used as the -t parameter of the docker build command"
    exit 1
fi

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
TIKA_SRC_PATH=${SCRIPT_DIR}/../../..
OUT_DIR=${TIKA_SRC_PATH}/tika-pipes/tika-grpc/target/tika-docker

mvn clean install -Dossindex.skip -DskipTests=true -Denforcer.skip=true -Dossindex.skip=true -f "${TIKA_SRC_PATH}" || exit
mvn dependency:copy-dependencies -f "${TIKA_SRC_PATH}/tika-pipes/tika-grpc" || exit
rm -rf "${OUT_DIR}"
mkdir -p "${OUT_DIR}"

project_version=$(mvn help:evaluate -Dexpression=project.version -q -DforceStdout -f "${TIKA_SRC_PATH}")

cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/target/dependency" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-gcs/target/tika-fetcher-gcs-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-az-blob/target/tika-fetcher-az-blob-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-http/target/tika-fetcher-http-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-microsoft-graph/target/tika-fetcher-microsoft-graph-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-s3/target/tika-fetcher-s3-${project_version}.jar" "${OUT_DIR}/libs"

cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/target/tika-grpc-${project_version}.jar" "${OUT_DIR}/libs"
cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/src/test/resources/log4j2.xml" "${OUT_DIR}"
cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/src/test/resources/tika-pipes-test-config.xml" "${OUT_DIR}/tika-config.xml"
cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/example-dockerfile/Dockerfile" "${OUT_DIR}/Dockerfile"

cd "${OUT_DIR}" || exit

# build single arch
#docker build "${OUT_DIR}" -t "${TAG_NAME}"

# Or we can build multi-arch - https://www.docker.com/blog/multi-arch-images/
docker buildx create --name tikabuilder
# see https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
docker run --rm --privileged tonistiigi/binfmt --install amd64
docker run --rm --privileged tonistiigi/binfmt --install arm64
docker buildx build --builder=tikabuilder "${OUT_DIR}" -t "${TAG_NAME}" --platform linux/amd64,linux/arm64 --push
docker buildx stop tikabuilder

bartek and others added 3 commits December 9, 2024 12:51
This allows the fetching of items using files.get from Google Drive
@bartek bartek force-pushed the bartek/tika-fetcher-google branch from f5e9dbe to c8d3ea7 Compare December 9, 2024 16:52
@THausherr
Copy link
Contributor

THausherr commented Dec 9, 2024

I didn't touch tika-grpc/pom.xml at all.

Your script has "tika-pipes/tika-grpc" however "tika-grpc" is at the top level.

@bartek
Copy link
Author

bartek commented Dec 10, 2024

@THausherr Thanks, I sorted that out. Looks like my paths are based on tika-grpc-3x-features branch paths.

@nddipiazza
Copy link
Contributor

I am porting this into https://github.com/nddipiazza/tika-pipes
starting now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants