-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UpdateDateTracker.java: Remove deprecated finalize method #3647
Conversation
The finalize() method has been deprecated and was never reliably called. Given that this is a singleton, there should not be a need to clean up prepared statements, as they will live for the lifetime of the indexing process, and the JVM shuts down after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion below. Otherwise the code looks solid and fits my understanding after reviewing the Java docs.
try (ResultSet result = selectSql.executeQuery()) { | ||
// No results? Return false: | ||
if (!result.first()) { | ||
returnValue = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you should still be safe just calling return false
here; the close would still work. https://stackoverflow.com/questions/22947755/try-with-resources-and-return-statements-in-java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've simplified as you suggested and it still seems to be working as expected for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I agree that some performance testing could be a good idea but I don't see any existing test capability in the index code, and a unit test wouldn't cut it since it needs the DB connection.
I'm trying to determine whether this change has significant performance implications, but my development environment does not seem ideal for making that determination, since the speed of indexing seems to vary wildly from run to run (probably because I work in a VM, and the VM performance is dependent on other activity on my computer). My initial experiments seem to suggest that this PR slows things down a bit, but I don't feel totally confident in that finding due to the inconsistencies in the data. Here's what I did: I indexed a MARC file containing 35,300 records 10 times each on this PR's branch and on the dev branch. Here are the durations of the runs (in seconds): dev: 16.85 (average = 16.653 seconds, min = 14.7, max = 21.13) this branch: 14.16 (average = 18.458 seconds, min = 14.16, max = 23.86) |
@mtrojan-ub, I think you may have more experience with Java optimization than I do. What do you think of the changes here? What do you think of my data above? Any ideas on a good way to profile this scientifically? :-) |
Alright... i am no real expert, but some thoughts / ideas:
|
@mtrojan-ub, the original code created the prepared statements in the constructor and reused them, but the problem with this approach is that there is no straightforward way to consistently clean up the statements, since Java doesn't have destructors. We previously used the finalize method, but that apparently did not work consistently and has now been deprecated. The new Cleaner mechanism might help, but it seems very complicated. There's also the open question of whether it even matters if we close prepared statements or not; I found a lot of inconsistent advice and grew concerned that the consequences may vary depending on different JDBC drivers. This led me to come to a solution where the statements are consistently closed. The idea of some kind of cache seems potentially useful, but as you say, there could be millions of records, so the memory utilization is very unpredictable. I don't think pre-loading the whole change_tracker table in advance is feasible for this reason. A more dynamic cache might work, but that makes things more complicated and less predictable -- so there's a question of whether improved performance is worth the potential introduction of bugs or problems. The queue with batch inserts also makes sense but has the same problem as the "prepared statements in the constructor" situation: I'm not sure how we detect the end of the indexing process to ensure that any outstanding work is completed/cleaned up. There's probably a way, but I'm not confident enough in my understanding of advanced Java mechanisms or the SolrMarc architecture to feel able to handle it reliably. I was not aware of (or had forgotten about) the SolrMarc method report, so that's definitely a great next step. I'll look at that soon. And of course, I don't mean to dismiss any of your suggestions -- I think they're all good! I'm just nervous about making things more complicated this close to a release, so I'm hoping to keep things simple for now. I'll report back once I have more information about method timings. |
@demiankatz, thanks for your extensive answer! The main purpose of this PR already seems to be fulfilled, unless we lose too much performance by using the try-with-resources approach (which you will most likely find out using the profiling). If performance is OK, the topic of speeding up the process might be a topic for an entirely different follow-up PR. Here are some more thoughts
|
@mtrojan-ub and @maccabeelevine: I did some testing with the First of all, turning on the method report setting causes a TREMENDOUS loss of performance. The data set that was previously taking 18 seconds to index took 14 minutes. That made it impractical to do many iterations with that test data, and it also raises questions about the accuracy of any data collected in this fashion. Setting that concern aside, I decided to test with the journals.mrc file that ships with VuFind (containing 10 records) to get a small baseline data set. I indexed the file five times each in four different scenarios: dev branch with an empty change_tracker table, dev branch with existing change_tracker data for the ten records, and PR branch in each of those two change_tracker states. Here's the resulting timing data:
This seems to suggest that in the "new data" case, the code here is actually faster than the existing dev code, and that in the "existing data" case, the differences are negligible. But of course, with the small sample size and unreliability of my test environment, this data may be meaningless. Any votes on whether we should just merge this and see how it goes, or come up with a more robust scientific experiment to measure the impact? I unfortunately don't have much time to devote to further testing at this stage, but I don't want to be irresponsible in my merging behavior. |
I also think it is very hard to get representative test results, but if the differences are really that small (especially the average for "existing" which will be the far more common case in your typical daily import is ~3%) this should be fine! |
Thank you both; I'm persuaded that we should merge this as-is, and we can look into future optimizations if anyone encounters a significant performance problem. |
The finalize() method has been deprecated and was never reliably called. This PR switches to use the try-with-resources method to ensure reliable freeing of allocated resources.
TODO