source/commentary alignment detection is incorrect #18

thatbudakguy · 2022-06-21T16:58:22Z

the implementation of #10 seems to be falsely attributing a substantial portion of annotations to the commentary, when really the annotation occurs first in the source (or vice versa). per @GDRom:

That's 23 of the 34 mismatches. Not sure what happened here -- two hypotheses:

This might be an issue with greediness of the search pattern?

This might be related to the algorithm not finding a first match, and is then somewhat off the rails?

Other mismatches (the ten remaining) would be harder imho to implement, but given the ratio -- 11 out of 55 -- doable by hand afterwards.

see https://github.com/direct-phonology/jdsw/blob/main/out/manual_eval/lunyu/001.txt for a hand-annotated copy indicating where the algorithm was incorrect.

generate test cases from manually annotated copy
make algorithm tolerant of variants (probably want to use str.maketrans for this)
make algorithm tolerant of small differences in phrasing (probably want to use levenshtein for this)

The text was updated successfully, but these errors were encountered:

See #18

thatbudakguy · 2022-07-27T19:01:42Z

@GDRom I think I may have tracked this bug down, thanks to your manually annotated alignment of the Lunyu. wanted to get your take on how to proceed. here's what I see...

we begin with the text "傳不" and corresponding LDM annotation "直專反注同鄭..." (quite long), which appear on line 8 of our JDSW digital copy of the first chapter of the lunyu. your notes say we should find this in the source text in the SBCK edition, and we do find it on line 18 there (emphasis mine): "與朋友交言而不信乎傳不習乎"
next we have the text "道" and corresponding LDM annotation "音導本或作導包云治也注及下同", which appear on line 9 of our JDSW. the single character "道" appears a total of 11 times in our SBCK edition. one of them is on line 12, and since it occurs before our previously found annotation (on line 18), we rightly discard it. then things get interesting — all other instances of "道" seem to be much further down the page, with the next one occurring all the way on line 55 of the SBCK edition, skipping over a sizable portion of text. this in itself isn't impossible, just unusual. your notes say we should find it in the source, and indeed it's in the source on line 55: "無改於父之道可謂孝矣(孔安國/曰孝子)"
next we have the text "千乗" and corresponding LDM annotation "繩證反注同千乘大國之賦也", which appear on line 10 of our JDSW. your notes say we ought to find it in the SBCK source, and we do find it four times: once on line 19, and three more times in a block of commentary that spans lines 23, 24, and 25. now we have a dilemma, however: we've already moved up to line 55 as a consequence of finding "道" there, and thus we can't consider any of the cases of "千乗", which are all "behind" us. it seems the correct approach would've been to find it on line 19: "(言九所傳之事得無/素不講習而傳乎)子曰導千乗"

my thought is this: is it possible that we should've actually found "道" as a graphic variant much earlier than line 55? there are only a few characters separating "傳不" and "千乗" in the source; it reads: "傳不習乎子曰導千乗". is "導" the variant we were looking for? i might have missed this but i don't think it's in your notes; you helpfully noted other variants with "variant SBCK".

if my guess is right, then the fix is just what we imagined: matching on variants. my only worry is that, by being too lenient, we might eagerly apply an LDM annotation to a variant, when instead the annotation rightly applies to the actual character itself further down the page (i.e. if LDM had intended to annotate "道" on line 55 instead of the much nearer "導"). i don't know enough to know if this worry is a real concern at this point, but maybe further testing will show us the way.

GDRom · 2022-07-27T19:47:03Z

@thatbudakguy In short, you are correct.

I went through the passage in question to check whether your hypothesis is correct in these instances, and it's spot on.

So we have the following three glosses:
傳不(直專反...)
道(音導本或作導...)
千乗(繩證反...)

In the SBCK, they occur on lines 11 and 12;
in the kanripo/TLS version, this occurs from lines 35-38.

As you noted, variants mess the automatic alignment up. In SBCK, 道 is written as 導; in kanripo/TLS, 千乗 is written as 千乘. So yes, we would have been looking for 導 (which obviously couldn't be matched).

Initial thoughts:
Tests sound like a good way to go.
Also, and fortunately, LDM notes as well that the 道 in question is sometimes written as 導 (本或作導); we might be able to draw from his comments when one-character sequences are problematic. For 千乗 vs. 千乘 -- given that two-character sequences are highly unlikely to occur in both ways in a single text, we could apply variant readings more liberally to them.

See #18

thatbudakguy · 2022-11-07T19:31:44Z

closing for now since we're tossing out the strategy of attempting to align to an SBCK edition that includes commentary — instead, we align directly to the Zhengwen edition, keeping only the places where LDM's headwords match that text. this is the strategy outlined in #11. if it turns out to not work well, we can revisit.

thatbudakguy added the bug Something isn't working label Jun 21, 2022

thatbudakguy self-assigned this Jun 21, 2022

This comment was marked as resolved.

Sign in to view

thatbudakguy added a commit that referenced this issue Jul 27, 2022

Add fixture and tests for alignment script

1d5d5c3

See #18

thatbudakguy added a commit that referenced this issue Aug 2, 2022

Add fixture and tests for alignment script

b2b9045

See #18

thatbudakguy closed this as completed Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source/commentary alignment detection is incorrect #18

source/commentary alignment detection is incorrect #18

thatbudakguy commented Jun 21, 2022 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

thatbudakguy commented Jul 27, 2022

GDRom commented Jul 27, 2022

thatbudakguy commented Nov 7, 2022

source/commentary alignment detection is incorrect #18

source/commentary alignment detection is incorrect #18

Comments

thatbudakguy commented Jun 21, 2022 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

thatbudakguy commented Jul 27, 2022

GDRom commented Jul 27, 2022

thatbudakguy commented Nov 7, 2022

thatbudakguy commented Jun 21, 2022 •

edited

Loading