- JSON configuration-driven data movement - no Java/Scala knowledge needed
- Join and transform data among heterogeneous datastores (including NoSQL datastores) using ANSI SQL
- Deploys on Amazon AWS EMR and Fargate; but can run on any Spark cluster
- Picks up datastore credentials stored in Hashicorp Vault, Amazon Secrets Manager
- Execution logs and migration history configurable to Amazon AWS Cloudwatch, S3
- Use built-in cron scheduler, or call REST API from external schedulers
... and many more features documented here
Note: DataPull consists of two services, an API written in Java Spring Boot, and a Spark app written in Scala. Although Scala apps can run on JDK 11, per official docs it is recommended that Java 8 be used for compiling Scala code. The effort to upgrade to OpenJDK 11+ is tracked here
Pre-requisite: Docker Desktop
- Clone this repo locally and check out the master branch
git clone [email protected]:homeaway/datapull.git
- build the Scala JAR from within the core folder
cd datapull/core make build
- Execute a sample JSON input file Input_Sample_filesystem-to-filesystem.json that moves data from a CSV file HelloWorld.csv to a folder of json files named SampleData_Json.
docker run -v $(pwd):/core -w /core -it --rm gettyimages/spark:2.2.1-hadoop-2.8 spark-submit --deploy-mode client --class core.DataPull target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar src/main/resources/Samples/Input_Sample_filesystem-to-filesystem.json local
- Open the relative path target/classes/SampleData_Json to find the result of the DataPull i.e. the data from target/classes/SampleData/HelloWorld.csv transformed into JSON.
Pre-requisite: IntelliJ with Scala plugin configured. Check out this Help page if this plugin is not installed.
- Clone this repo locally and check out the master branch
- Open the folder core in IntelliJ IDE.
- When prompted, add this project as a maven project.
- By default, this source code is designed to execute a sample JSON input file Input_Sample_filesystem-to-filesystem.json that moves data from a CSV file HelloWorld.csv to a folder of json files named SampleData_Json.
- Go to File > Project Structure... , and choose 1.8 (java version) as the Project SDK
- Go to Run > Edit Configurations... , and do the following
- Create an Application configuration (use the + sign on the top left corner of the modal window)
- Set the Name to Debug
- Set the Main Class as Core.DataPull
- Use classpath of module Core.DataPull
- Set JRE to 1.8
- Click Apply and then OK
- Click Run > Debug 'Debug' to start the debug execution
- Open the relative path target/classes/SampleData_Json to find the result of the DataPull i.e. the data from target/classes/SampleData/HelloWorld.csv transformed into JSON.
Deploying DataPull to Amazon AWS, involves
- installing the DataPull API and Spark JAR in AWS Fargate, using this runbook
- running DataPulls in AWS EMR, using this runbook
Please create an issue in this git repo, using the bug report or feature request templates.
DataPull documentation is available at https://homeaway.github.io/datapull/ . To update this documentation, please do the following steps...
- Create a Feature Request issue
- Please fill in the title and the body of the issue. Our suggested title is "Documentation for
<what this documentation is for>
"
- Please fill in the title and the body of the issue. Our suggested title is "Documentation for
- Fork the DataPull repo
- Install MkDocs and Material for MkDocs
- Clone your forked repo locally, and run
mkdocs serve
in Terminal from thedocs
folder of the repo - Open http://127.0.0.1/8000 to see a preview of the documentation site. You can edit the documentation by following https://www.mkdocs.org/#getting-started
- Once you're done updating the documentation, please commit and push your local master branch to your fork. Also, run
mkdocs gh-deploy
at the terminal to update and push yourgh-pages
branch. - Create 2 PRs (one for
master
branch, one forgh-pages
branch) and we'll review and approve them. - Thanks again, for helping make DataPull better!