UpdateDateTracker.java: Remove deprecated finalize method #3647

demiankatz · 2024-05-08T20:33:34Z

The finalize() method has been deprecated and was never reliably called. This PR switches to use the try-with-resources method to ensure reliable freeing of allocated resources.

TODO

Update changelog when merging
Close VUFIND-1651 when merging

The finalize() method has been deprecated and was never reliably called. Given that this is a singleton, there should not be a need to clean up prepared statements, as they will live for the lifetime of the indexing process, and the JVM shuts down after that.

maccabeelevine

One suggestion below. Otherwise the code looks solid and fits my understanding after reviewing the Java docs.

maccabeelevine · 2024-05-09T14:10:01Z

import/index_java/src/org/vufind/index/UpdateDateTracker.java

+            try (ResultSet result = selectSql.executeQuery()) {
+                // No results? Return false:
+                if (!result.first()) {
+                    returnValue = false;


It looks like you should still be safe just calling return false here; the close would still work. https://stackoverflow.com/questions/22947755/try-with-resources-and-return-statements-in-java

Thanks! I've simplified as you suggested and it still seems to be working as expected for me.

maccabeelevine

LGTM! I agree that some performance testing could be a good idea but I don't see any existing test capability in the index code, and a unit test wouldn't cut it since it needs the DB connection.

demiankatz · 2024-05-09T21:12:36Z

I'm trying to determine whether this change has significant performance implications, but my development environment does not seem ideal for making that determination, since the speed of indexing seems to vary wildly from run to run (probably because I work in a VM, and the VM performance is dependent on other activity on my computer). My initial experiments seem to suggest that this PR slows things down a bit, but I don't feel totally confident in that finding due to the inconsistencies in the data.

Here's what I did: I indexed a MARC file containing 35,300 records 10 times each on this PR's branch and on the dev branch. Here are the durations of the runs (in seconds):

dev:

16.85
15.49
14.87
15.32
14.7
18.37
15.79
17.02
21.13
16.99

(average = 16.653 seconds, min = 14.7, max = 21.13)

this branch:

14.16
15.73
19.91
20.55
23.86
17.89
20.4
18.07
15.97
18.04

(average = 18.458 seconds, min = 14.16, max = 23.86)

demiankatz · 2024-05-09T21:13:21Z

@mtrojan-ub, I think you may have more experience with Java optimization than I do. What do you think of the changes here? What do you think of my data above? Any ideas on a good way to profile this scientifically? :-)

mtrojan-ub · 2024-05-13T07:20:59Z

Alright... i am no real expert, but some thoughts / ideas:

Using prepared statements are fine, but i wonder whether it would make more sense to call db.prepareStatement for every statement only once in the constructor, store the prepared statement in a class variable and clone it every time one of the functions is called
As far as i understand the whole process, readRow() will be executed for every record (which could be millions of times). I wonder if it would be a lot faster to just select the whole table once generate some kind of cache (e.g. a Map structure) which could be used for a faster lookup, which might drastically reduce the amount of called select statements. Of course this would cost more RAM but i think that might be worth it since last_indexed is one of the fields that costs the most performance during the whole import process right now.
Similar for createRow() / updateRow(): Can we generate some kind of Queue and do Batch inserts / updates when a threshold is hit? (of course it would be necessary to think about some error scenarios, and also when the whole process finished without hitting the threshold to commit the rest)
For profiling: Regarding your durations of the runs, how exactly are they measured? Are you already considering the solrmarc report? Dsolrmarc.method.report=true, see https://github.com/solrmarc/solrmarc/wiki/Other-command-line-options

demiankatz · 2024-05-13T11:39:34Z

@mtrojan-ub, the original code created the prepared statements in the constructor and reused them, but the problem with this approach is that there is no straightforward way to consistently clean up the statements, since Java doesn't have destructors. We previously used the finalize method, but that apparently did not work consistently and has now been deprecated. The new Cleaner mechanism might help, but it seems very complicated.

There's also the open question of whether it even matters if we close prepared statements or not; I found a lot of inconsistent advice and grew concerned that the consequences may vary depending on different JDBC drivers. This led me to come to a solution where the statements are consistently closed.

The idea of some kind of cache seems potentially useful, but as you say, there could be millions of records, so the memory utilization is very unpredictable. I don't think pre-loading the whole change_tracker table in advance is feasible for this reason. A more dynamic cache might work, but that makes things more complicated and less predictable -- so there's a question of whether improved performance is worth the potential introduction of bugs or problems.

The queue with batch inserts also makes sense but has the same problem as the "prepared statements in the constructor" situation: I'm not sure how we detect the end of the indexing process to ensure that any outstanding work is completed/cleaned up. There's probably a way, but I'm not confident enough in my understanding of advanced Java mechanisms or the SolrMarc architecture to feel able to handle it reliably.

I was not aware of (or had forgotten about) the SolrMarc method report, so that's definitely a great next step. I'll look at that soon.

And of course, I don't mean to dismiss any of your suggestions -- I think they're all good! I'm just nervous about making things more complicated this close to a release, so I'm hoping to keep things simple for now. I'll report back once I have more information about method timings.

mtrojan-ub · 2024-05-13T13:16:03Z

@demiankatz, thanks for your extensive answer!

The main purpose of this PR already seems to be fulfilled, unless we lose too much performance by using the try-with-resources approach (which you will most likely find out using the profiling). If performance is OK, the topic of speeding up the process might be a topic for an entirely different follow-up PR.

Here are some more thoughts

If generating lots of prepared statements is really time critical (which i'm not entirely sure about), we could also try to use a small stored procedure & test whether that's faster and / or might avoid the autoclose problem (which i'm not sure about)
If finalize is no longer available, we might try to use a different pattern like e.g. Cleaners. Or, maybe solrmarc could add an explicit non-standard interface with a custom finalize method which is expicitly called after processing the last record.

demiankatz · 2024-05-13T17:01:39Z

@mtrojan-ub and @maccabeelevine: I did some testing with the solrmarc.method.report setting, focusing on getFirstIndexed (because getLastIndexed uses only a trivial amount of time due to the data cached by getFirstIndexed). I still don't have particularly conclusive results, but at least I can't say that what I'm seeing alarms me.

First of all, turning on the method report setting causes a TREMENDOUS loss of performance. The data set that was previously taking 18 seconds to index took 14 minutes. That made it impractical to do many iterations with that test data, and it also raises questions about the accuracy of any data collected in this fashion.

Setting that concern aside, I decided to test with the journals.mrc file that ships with VuFind (containing 10 records) to get a small baseline data set. I indexed the file five times each in four different scenarios: dev branch with an empty change_tracker table, dev branch with existing change_tracker data for the ten records, and PR branch in each of those two change_tracker states. Here's the resulting timing data:

	new (dev)	existing (dev)	new (PR)	existing (PR)
run 1	0.771	0.452	0.494	0.405
run 2	0.525	0.434	0.51	0.408
run 3	0.542	0.449	0.55	0.515
run 4	0.543	0.407	0.552	0.427
run 5	0.525	0.469	0.499	0.528
average	0.5812	0.4422	0.521	0.4566
min	0.525	0.407	0.494	0.405
max	0.771	0.469	0.552	0.528

This seems to suggest that in the "new data" case, the code here is actually faster than the existing dev code, and that in the "existing data" case, the differences are negligible.

But of course, with the small sample size and unreliability of my test environment, this data may be meaningless.

Any votes on whether we should just merge this and see how it goes, or come up with a more robust scientific experiment to measure the impact? I unfortunately don't have much time to devote to further testing at this stage, but I don't want to be irresponsible in my merging behavior.

mtrojan-ub · 2024-05-14T07:17:47Z

I also think it is very hard to get representative test results, but if the differences are really that small (especially the average for "existing" which will be the far more common case in your typical daily import is ~3%) this should be fine!

demiankatz · 2024-05-14T12:47:37Z

Thank you both; I'm persuaded that we should merge this as-is, and we can look into future optimizations if anyone encounters a significant performance problem.

damien-git · 2024-11-25T16:36:02Z

Related PR

demiankatz added the improvement label May 8, 2024

demiankatz added this to the 10.0 milestone May 8, 2024

demiankatz requested a review from maccabeelevine May 8, 2024 20:33

demiankatz added 3 commits May 8, 2024 16:55

Switch to try-with-resources statements.

36f4db8

Fix bug.

4fc9949

More code simplification.

f6d0920

demiankatz changed the title ~~Java: Remove deprecated finalize method~~ UpdateDateTracker.java: Remove deprecated finalize method May 9, 2024

maccabeelevine requested changes May 9, 2024

View reviewed changes

demiankatz added 2 commits May 9, 2024 12:59

Merge branch 'dev' into vufind-1651

89d7e93

Code simplification.

ad3942e

demiankatz requested a review from maccabeelevine May 9, 2024 17:04

maccabeelevine approved these changes May 9, 2024

View reviewed changes

Merge branch 'dev' into vufind-1651

9d32580

demiankatz requested a review from mtrojan-ub May 9, 2024 21:06

demiankatz removed the request for review from mtrojan-ub May 14, 2024 12:47

demiankatz merged commit a33a1c3 into vufind-org:dev May 14, 2024
7 checks passed

demiankatz deleted the vufind-1651 branch May 14, 2024 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UpdateDateTracker.java: Remove deprecated finalize method #3647

UpdateDateTracker.java: Remove deprecated finalize method #3647

demiankatz commented May 8, 2024 •

edited

Loading

maccabeelevine left a comment

maccabeelevine May 9, 2024

demiankatz May 9, 2024

maccabeelevine left a comment

demiankatz commented May 9, 2024

demiankatz commented May 9, 2024

mtrojan-ub commented May 13, 2024 •

edited

Loading

demiankatz commented May 13, 2024

mtrojan-ub commented May 13, 2024

demiankatz commented May 13, 2024

mtrojan-ub commented May 14, 2024

demiankatz commented May 14, 2024

damien-git commented Nov 25, 2024

UpdateDateTracker.java: Remove deprecated finalize method #3647

UpdateDateTracker.java: Remove deprecated finalize method #3647

Conversation

demiankatz commented May 8, 2024 • edited Loading

maccabeelevine left a comment

Choose a reason for hiding this comment

maccabeelevine May 9, 2024

Choose a reason for hiding this comment

demiankatz May 9, 2024

Choose a reason for hiding this comment

maccabeelevine left a comment

Choose a reason for hiding this comment

demiankatz commented May 9, 2024

demiankatz commented May 9, 2024

mtrojan-ub commented May 13, 2024 • edited Loading

demiankatz commented May 13, 2024

mtrojan-ub commented May 13, 2024

demiankatz commented May 13, 2024

mtrojan-ub commented May 14, 2024

demiankatz commented May 14, 2024

damien-git commented Nov 25, 2024

demiankatz commented May 8, 2024 •

edited

Loading

mtrojan-ub commented May 13, 2024 •

edited

Loading