Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the upper limit of WARC content length to half of Integer.MAX_VALUE #496

Closed

Conversation

adamyy
Copy link

@adamyy adamyy commented Sep 3, 2020

Mute the java.lang.NegativeArraySizeException issue thrown when the content length of a WARC record exceeds half of Integer.MAX_VALUE.

GitHub issue(s): #317, #494

What does this Pull Request do?

The current AUT ArchiveRecord implementation eagerly consumes the content of the WARC record into a byte array and a String object. Problem is that not all WARC records can fit inside of a java.lang.String. TL;DR attempting to new String(byteArray) with byteArray that is longer than half of Integer.MAX_VALUE will cause java.lang.NegativeArraySizeException (for OpenJDK 11, UTF-8 charset), with reason being that java.lang.String creates an internal byte array that is double the size of the argument. And in Java, the maximum size of array is Integer.MAX_VALUE.

  • In RecordLoader.loadArchives, filter out WARCs whose content is longer than MAX_ALLOWABLE_WARC_CONTENT_LENGTH
  • Set MAX_ALLOWABLE_WARC_CONTENT_LENGTH to Integer.MAX_VALUE >> 1

How should this be tested?

  • An example of a WARC record that is too large can be found inARCHIVEIT-10689-TEST-JOB727752-SEED1799564-20190110143759592-00000-h3.warc.gz. Before this PR, any action invoked on this file will result in NegativeArraySizeException, this PR will skip the large record
  • This PR should still pass all the current tests in CI

Additional Notes:

As discussed in #317 (comment), this PR merely mutes the issue with large WARCs, but it might still be reasonable for the users to access the content of a large WARC, perhaps in the form of InputStreams. This is already noted in #494.

@ruebot @ianmilligan1 @lintool

@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #496 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main     #496      +/-   ##
============================================
+ Coverage     88.85%   88.86%   +0.01%     
  Complexity       57       57              
============================================
  Files            43       43              
  Lines          1014     1015       +1     
  Branches         86       86              
============================================
+ Hits            901      902       +1     
  Misses           74       74              
  Partials         39       39              

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruebot ruebot closed this Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants