Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source/commentary alignment detection is incorrect #18

Closed
3 tasks done
thatbudakguy opened this issue Jun 21, 2022 · 7 comments
Closed
3 tasks done

source/commentary alignment detection is incorrect #18

thatbudakguy opened this issue Jun 21, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@thatbudakguy
Copy link
Member

thatbudakguy commented Jun 21, 2022

the implementation of #10 seems to be falsely attributing a substantial portion of annotations to the commentary, when really the annotation occurs first in the source (or vice versa). per @GDRom:

That's 23 of the 34 mismatches. Not sure what happened here -- two hypotheses:

  • This might be an issue with greediness of the search pattern?
  • This might be related to the algorithm not finding a first match, and is then somewhat off the rails?

Other mismatches (the ten remaining) would be harder imho to implement, but given the ratio -- 11 out of 55 -- doable by hand afterwards.

see https://github.com/direct-phonology/jdsw/blob/main/out/manual_eval/lunyu/001.txt for a hand-annotated copy indicating where the algorithm was incorrect.

  • generate test cases from manually annotated copy
  • make algorithm tolerant of variants (probably want to use str.maketrans for this)
  • make algorithm tolerant of small differences in phrasing (probably want to use levenshtein for this)
@thatbudakguy thatbudakguy added the bug Something isn't working label Jun 21, 2022
@thatbudakguy thatbudakguy self-assigned this Jun 21, 2022
@GDRom

This comment was marked as resolved.

@thatbudakguy

This comment was marked as resolved.

@GDRom

This comment was marked as resolved.

@thatbudakguy

This comment was marked as resolved.

thatbudakguy added a commit that referenced this issue Jul 27, 2022
@thatbudakguy
Copy link
Member Author

@GDRom I think I may have tracked this bug down, thanks to your manually annotated alignment of the Lunyu. wanted to get your take on how to proceed. here's what I see...

  1. we begin with the text "傳不" and corresponding LDM annotation "直專反注同鄭..." (quite long), which appear on line 8 of our JDSW digital copy of the first chapter of the lunyu. your notes say we should find this in the source text in the SBCK edition, and we do find it on line 18 there (emphasis mine): "與朋友交言而不信乎 傳不 習乎"
  2. next we have the text "道" and corresponding LDM annotation "音導本或作導包云治也注及下同", which appear on line 9 of our JDSW. the single character "道" appears a total of 11 times in our SBCK edition. one of them is on line 12, and since it occurs before our previously found annotation (on line 18), we rightly discard it. then things get interesting — all other instances of "道" seem to be much further down the page, with the next one occurring all the way on line 55 of the SBCK edition, skipping over a sizable portion of text. this in itself isn't impossible, just unusual. your notes say we should find it in the source, and indeed it's in the source on line 55: "無改於父之 可謂孝矣(孔安國/曰孝子)"
  3. next we have the text "千乗" and corresponding LDM annotation "繩證反注同千乘大國之賦也", which appear on line 10 of our JDSW. your notes say we ought to find it in the SBCK source, and we do find it four times: once on line 19, and three more times in a block of commentary that spans lines 23, 24, and 25. now we have a dilemma, however: we've already moved up to line 55 as a consequence of finding "道" there, and thus we can't consider any of the cases of "千乗", which are all "behind" us. it seems the correct approach would've been to find it on line 19: "(言九所傳之事得無/素不講習而傳乎)子曰導 千乗"

my thought is this: is it possible that we should've actually found "道" as a graphic variant much earlier than line 55? there are only a few characters separating "傳不" and "千乗" in the source; it reads: "傳不習乎子曰導千乗". is "導" the variant we were looking for? i might have missed this but i don't think it's in your notes; you helpfully noted other variants with "variant SBCK".

if my guess is right, then the fix is just what we imagined: matching on variants. my only worry is that, by being too lenient, we might eagerly apply an LDM annotation to a variant, when instead the annotation rightly applies to the actual character itself further down the page (i.e. if LDM had intended to annotate "道" on line 55 instead of the much nearer "導"). i don't know enough to know if this worry is a real concern at this point, but maybe further testing will show us the way.

@GDRom
Copy link
Member

GDRom commented Jul 27, 2022

@thatbudakguy In short, you are correct.

I went through the passage in question to check whether your hypothesis is correct in these instances, and it's spot on.

So we have the following three glosses:
傳不(直專反...)
道(音導本或作導...)
千乗(繩證反...)

In the SBCK, they occur on lines 11 and 12;
in the kanripo/TLS version, this occurs from lines 35-38.

As you noted, variants mess the automatic alignment up. In SBCK, 道 is written as 導; in kanripo/TLS, 千乗 is written as 千乘. So yes, we would have been looking for 導 (which obviously couldn't be matched).

Initial thoughts:
Tests sound like a good way to go.
Also, and fortunately, LDM notes as well that the 道 in question is sometimes written as 導 (本或作導); we might be able to draw from his comments when one-character sequences are problematic. For 千乗 vs. 千乘 -- given that two-character sequences are highly unlikely to occur in both ways in a single text, we could apply variant readings more liberally to them.

thatbudakguy added a commit that referenced this issue Aug 2, 2022
@thatbudakguy
Copy link
Member Author

closing for now since we're tossing out the strategy of attempting to align to an SBCK edition that includes commentary — instead, we align directly to the Zhengwen edition, keeping only the places where LDM's headwords match that text. this is the strategy outlined in #11. if it turns out to not work well, we can revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants