This tool is specifically designed for syncing static web sites managed via FTP to S3.
- Posix only.
- Efficient:
- Need to work with millions of files (but not 10s of millions)
- Can't wait very long. This implies using multiple threads.
- Not practical to compute md5s
- known usage patterns mean we can simply use modification times.
- Addresses an insane requirement of emulating insane rewrites by making extra data copies. Whimper.
- Option local index avoids listing buckets, which is important for large buckets.
- Retry on failed AWS operations.
On each syncronization:
Get {path->mtime} mapping for s3 and for local file system.
This is done in 2 threads.
Optionally, we can keep an index, based on the previous file scan and avoid scanning s3.
Compute diff.
Apply diff to s3
- Configurable number of workers apply changes one by one.
- Workers fed from queue.
As an optimization to try to avoid creation of giant dicts, when building dicts, if a key is already in the other dict, to the comparison right away.
To decide if files have changed, we compare file-system modification times and S3 object modification times. This is awkward as they don't match up precicely. We end up having to add a fudge factor to file-system modification times to account for the fact that thay don't line up, as well as for clock skew.
- Note on AWS keys
- You pass keys via AWS instance roles (if running in AWS), .boto files, keyring, or .boto files.
- Added support for cloudfront invalidations.
- Fixed: directories with weird file names broke index generation.
- Fixed: index.html files included dot files.
Fixed: Content-Type wasn't set for generated index.html files.
(Also tweaked html layout to force new index.html files to be sent.
- Added generation of index.html files in S3 for directories without them on the file system.
- Removed simple prefix rewrites. We didn't need them.
- Fixed: the restore script didn't remove extra files from the destination directory.
Added a simple restore script for restoring files from S3. It can restore an entire directory or update a directory, syncing with S3 based on file size.
Added missing retry on failed adds or deletes.
Added support for using a local index file to avoid lengthy bucket scans.
Added lock-file support to avoid simultaneous syncs.
Added support for bucket prefixes (mainly for secondary use cases).
Added a -D option to disable deleting keys.
Implement simple prefix rewrites that duplicate keys matching certain prefixes to the same keys but with different prefixes.
- Fix: needed to use encoded file names when reading data from file
- system. (We were storing them decoded and boto was using a different encoding when trying to read them.)
Decode file paths using the configured encoding, which defaults to latin-1.
Refactored the way time stamps are compared. Iterating over s3 buckets doesnt' return user-defined meta data (and it woould be too expensive to fetch it on a case-by-case bases), so we can't capture the original mtimes (which has a race condition the way we did it anyway). Instead, we now compare file-system modification times with S3 object modification times, using a fudge factor to account for the fact that they're not computed the same way, and for clock skew.
Initial release