Set the upper limit of WARC content length to half of Integer.MAX_VALUE #496
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Mute the
java.lang.NegativeArraySizeException
issue thrown when the content length of a WARC record exceeds half ofInteger.MAX_VALUE
.GitHub issue(s): #317, #494
What does this Pull Request do?
The current AUT
ArchiveRecord
implementation eagerly consumes the content of the WARC record into a byte array and a String object. Problem is that not all WARC records can fit inside of ajava.lang.String
. TL;DR attempting tonew String(byteArray)
withbyteArray
that is longer than half ofInteger.MAX_VALUE
will causejava.lang.NegativeArraySizeException
(for OpenJDK 11, UTF-8 charset), with reason being thatjava.lang.String
creates an internal byte array that is double the size of the argument. And in Java, the maximum size of array isInteger.MAX_VALUE
.RecordLoader.loadArchives
, filter out WARCs whose content is longer thanMAX_ALLOWABLE_WARC_CONTENT_LENGTH
MAX_ALLOWABLE_WARC_CONTENT_LENGTH
toInteger.MAX_VALUE >> 1
How should this be tested?
ARCHIVEIT-10689-TEST-JOB727752-SEED1799564-20190110143759592-00000-h3.warc.gz
. Before this PR, any action invoked on this file will result in NegativeArraySizeException, this PR will skip the large recordAdditional Notes:
As discussed in #317 (comment), this PR merely mutes the issue with large WARCs, but it might still be reasonable for the users to access the content of a large WARC, perhaps in the form of InputStreams. This is already noted in #494.
@ruebot @ianmilligan1 @lintool