Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpdateDateTracker.java: Remove deprecated finalize method #3647

Merged
merged 7 commits into from
May 14, 2024

Conversation

demiankatz
Copy link
Member

@demiankatz demiankatz commented May 8, 2024

The finalize() method has been deprecated and was never reliably called. This PR switches to use the try-with-resources method to ensure reliable freeing of allocated resources.

TODO

  • Update changelog when merging
  • Close VUFIND-1651 when merging

The finalize() method has been deprecated and was never reliably called. Given that this is a singleton, there should not be a need to clean up prepared statements, as they will live for the lifetime of the indexing process, and the JVM shuts down after that.
@demiankatz demiankatz added this to the 10.0 milestone May 8, 2024
@demiankatz demiankatz requested a review from maccabeelevine May 8, 2024 20:33
@demiankatz demiankatz changed the title Java: Remove deprecated finalize method UpdateDateTracker.java: Remove deprecated finalize method May 9, 2024
Copy link
Member

@maccabeelevine maccabeelevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion below. Otherwise the code looks solid and fits my understanding after reviewing the Java docs.

try (ResultSet result = selectSql.executeQuery()) {
// No results? Return false:
if (!result.first()) {
returnValue = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you should still be safe just calling return false here; the close would still work. https://stackoverflow.com/questions/22947755/try-with-resources-and-return-statements-in-java

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've simplified as you suggested and it still seems to be working as expected for me.

@demiankatz demiankatz requested a review from maccabeelevine May 9, 2024 17:04
Copy link
Member

@maccabeelevine maccabeelevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I agree that some performance testing could be a good idea but I don't see any existing test capability in the index code, and a unit test wouldn't cut it since it needs the DB connection.

@demiankatz demiankatz requested a review from mtrojan-ub May 9, 2024 21:06
@demiankatz
Copy link
Member Author

I'm trying to determine whether this change has significant performance implications, but my development environment does not seem ideal for making that determination, since the speed of indexing seems to vary wildly from run to run (probably because I work in a VM, and the VM performance is dependent on other activity on my computer). My initial experiments seem to suggest that this PR slows things down a bit, but I don't feel totally confident in that finding due to the inconsistencies in the data.

Here's what I did: I indexed a MARC file containing 35,300 records 10 times each on this PR's branch and on the dev branch. Here are the durations of the runs (in seconds):

dev:

16.85
15.49
14.87
15.32
14.7
18.37
15.79
17.02
21.13
16.99

(average = 16.653 seconds, min = 14.7, max = 21.13)

this branch:

14.16
15.73
19.91
20.55
23.86
17.89
20.4
18.07
15.97
18.04

(average = 18.458 seconds, min = 14.16, max = 23.86)

@demiankatz
Copy link
Member Author

@mtrojan-ub, I think you may have more experience with Java optimization than I do. What do you think of the changes here? What do you think of my data above? Any ideas on a good way to profile this scientifically? :-)

@mtrojan-ub
Copy link
Contributor

mtrojan-ub commented May 13, 2024

Alright... i am no real expert, but some thoughts / ideas:

  • Using prepared statements are fine, but i wonder whether it would make more sense to call db.prepareStatement for every statement only once in the constructor, store the prepared statement in a class variable and clone it every time one of the functions is called
  • As far as i understand the whole process, readRow() will be executed for every record (which could be millions of times). I wonder if it would be a lot faster to just select the whole table once generate some kind of cache (e.g. a Map structure) which could be used for a faster lookup, which might drastically reduce the amount of called select statements. Of course this would cost more RAM but i think that might be worth it since last_indexed is one of the fields that costs the most performance during the whole import process right now.
  • Similar for createRow() / updateRow(): Can we generate some kind of Queue and do Batch inserts / updates when a threshold is hit? (of course it would be necessary to think about some error scenarios, and also when the whole process finished without hitting the threshold to commit the rest)
  • For profiling: Regarding your durations of the runs, how exactly are they measured? Are you already considering the solrmarc report? Dsolrmarc.method.report=true, see https://github.com/solrmarc/solrmarc/wiki/Other-command-line-options

@demiankatz
Copy link
Member Author

@mtrojan-ub, the original code created the prepared statements in the constructor and reused them, but the problem with this approach is that there is no straightforward way to consistently clean up the statements, since Java doesn't have destructors. We previously used the finalize method, but that apparently did not work consistently and has now been deprecated. The new Cleaner mechanism might help, but it seems very complicated.

There's also the open question of whether it even matters if we close prepared statements or not; I found a lot of inconsistent advice and grew concerned that the consequences may vary depending on different JDBC drivers. This led me to come to a solution where the statements are consistently closed.

The idea of some kind of cache seems potentially useful, but as you say, there could be millions of records, so the memory utilization is very unpredictable. I don't think pre-loading the whole change_tracker table in advance is feasible for this reason. A more dynamic cache might work, but that makes things more complicated and less predictable -- so there's a question of whether improved performance is worth the potential introduction of bugs or problems.

The queue with batch inserts also makes sense but has the same problem as the "prepared statements in the constructor" situation: I'm not sure how we detect the end of the indexing process to ensure that any outstanding work is completed/cleaned up. There's probably a way, but I'm not confident enough in my understanding of advanced Java mechanisms or the SolrMarc architecture to feel able to handle it reliably.

I was not aware of (or had forgotten about) the SolrMarc method report, so that's definitely a great next step. I'll look at that soon.

And of course, I don't mean to dismiss any of your suggestions -- I think they're all good! I'm just nervous about making things more complicated this close to a release, so I'm hoping to keep things simple for now. I'll report back once I have more information about method timings.

@mtrojan-ub
Copy link
Contributor

@demiankatz, thanks for your extensive answer!

The main purpose of this PR already seems to be fulfilled, unless we lose too much performance by using the try-with-resources approach (which you will most likely find out using the profiling). If performance is OK, the topic of speeding up the process might be a topic for an entirely different follow-up PR.

Here are some more thoughts

  • If generating lots of prepared statements is really time critical (which i'm not entirely sure about), we could also try to use a small stored procedure & test whether that's faster and / or might avoid the autoclose problem (which i'm not sure about)
  • If finalize is no longer available, we might try to use a different pattern like e.g. Cleaners. Or, maybe solrmarc could add an explicit non-standard interface with a custom finalize method which is expicitly called after processing the last record.

@demiankatz
Copy link
Member Author

@mtrojan-ub and @maccabeelevine: I did some testing with the solrmarc.method.report setting, focusing on getFirstIndexed (because getLastIndexed uses only a trivial amount of time due to the data cached by getFirstIndexed). I still don't have particularly conclusive results, but at least I can't say that what I'm seeing alarms me.

First of all, turning on the method report setting causes a TREMENDOUS loss of performance. The data set that was previously taking 18 seconds to index took 14 minutes. That made it impractical to do many iterations with that test data, and it also raises questions about the accuracy of any data collected in this fashion.

Setting that concern aside, I decided to test with the journals.mrc file that ships with VuFind (containing 10 records) to get a small baseline data set. I indexed the file five times each in four different scenarios: dev branch with an empty change_tracker table, dev branch with existing change_tracker data for the ten records, and PR branch in each of those two change_tracker states. Here's the resulting timing data:

  new (dev) existing (dev) new (PR) existing (PR)
run 1 0.771 0.452 0.494 0.405
run 2 0.525 0.434 0.51 0.408
run 3 0.542 0.449 0.55 0.515
run 4 0.543 0.407 0.552 0.427
run 5 0.525 0.469 0.499 0.528
average 0.5812 0.4422 0.521 0.4566
min 0.525 0.407 0.494 0.405
max 0.771 0.469 0.552 0.528

This seems to suggest that in the "new data" case, the code here is actually faster than the existing dev code, and that in the "existing data" case, the differences are negligible.

But of course, with the small sample size and unreliability of my test environment, this data may be meaningless.

Any votes on whether we should just merge this and see how it goes, or come up with a more robust scientific experiment to measure the impact? I unfortunately don't have much time to devote to further testing at this stage, but I don't want to be irresponsible in my merging behavior.

@mtrojan-ub
Copy link
Contributor

I also think it is very hard to get representative test results, but if the differences are really that small (especially the average for "existing" which will be the far more common case in your typical daily import is ~3%) this should be fine!

@demiankatz demiankatz removed the request for review from mtrojan-ub May 14, 2024 12:47
@demiankatz
Copy link
Member Author

Thank you both; I'm persuaded that we should merge this as-is, and we can look into future optimizations if anyone encounters a significant performance problem.

@demiankatz demiankatz merged commit a33a1c3 into vufind-org:dev May 14, 2024
7 checks passed
@demiankatz demiankatz deleted the vufind-1651 branch May 14, 2024 12:48
@damien-git
Copy link
Contributor

Related PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants