-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source/commentary alignment detection is incorrect #18
Comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@GDRom I think I may have tracked this bug down, thanks to your manually annotated alignment of the Lunyu. wanted to get your take on how to proceed. here's what I see...
my thought is this: is it possible that we should've actually found "道" as a graphic variant much earlier than line 55? there are only a few characters separating "傳不" and "千乗" in the source; it reads: "傳不習乎子曰導千乗". is "導" the variant we were looking for? i might have missed this but i don't think it's in your notes; you helpfully noted other variants with "variant SBCK". if my guess is right, then the fix is just what we imagined: matching on variants. my only worry is that, by being too lenient, we might eagerly apply an LDM annotation to a variant, when instead the annotation rightly applies to the actual character itself further down the page (i.e. if LDM had intended to annotate "道" on line 55 instead of the much nearer "導"). i don't know enough to know if this worry is a real concern at this point, but maybe further testing will show us the way. |
@thatbudakguy In short, you are correct. I went through the passage in question to check whether your hypothesis is correct in these instances, and it's spot on. So we have the following three glosses: In the SBCK, they occur on lines 11 and 12; As you noted, variants mess the automatic alignment up. In SBCK, 道 is written as 導; in kanripo/TLS, 千乗 is written as 千乘. So yes, we would have been looking for 導 (which obviously couldn't be matched). Initial thoughts: |
closing for now since we're tossing out the strategy of attempting to align to an SBCK edition that includes commentary — instead, we align directly to the Zhengwen edition, keeping only the places where LDM's headwords match that text. this is the strategy outlined in #11. if it turns out to not work well, we can revisit. |
the implementation of #10 seems to be falsely attributing a substantial portion of annotations to the commentary, when really the annotation occurs first in the source (or vice versa). per @GDRom:
see https://github.com/direct-phonology/jdsw/blob/main/out/manual_eval/lunyu/001.txt for a hand-annotated copy indicating where the algorithm was incorrect.
str.maketrans
for this)The text was updated successfully, but these errors were encountered: