Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically align jdsw with cleaned source texts #11

Open
thatbudakguy opened this issue May 27, 2022 · 6 comments
Open

automatically align jdsw with cleaned source texts #11

thatbudakguy opened this issue May 27, 2022 · 6 comments
Assignees

Comments

@thatbudakguy
Copy link
Member

thatbudakguy commented May 27, 2022

@GDRom has already done some work to do this manually; we want to see if we can automate it.

#10 is a prerequisite for getting the JDSW in shape to align.
#9 is a prerequisite for getting 正文 versions to align to.

this uses a modified version of the algorithm from #10:

  1. Look thru cleaned JDSW from clean annotations of commentaries from jdsw #10 and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
  2. For each key: value pair...
    a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential)
    b. If that key is found, take the JDSW annotation and insert it into the source text at that point
    c. If that key isn't found, just skip since we'll already know about it from clean annotations of commentaries from jdsw #10
@thatbudakguy thatbudakguy self-assigned this May 27, 2022
@GDRom
Copy link
Member

GDRom commented Jun 8, 2022

Following up on this, as I will soon approach 2.d) from #10, to look into what's going on with the keys that couldn't be assigned.

Ideally, I would identify an underlying logic to that issue.
If not, however, but I'd identify individual keys within source text, should I already insert those also into relevant source texts in /out?

@thatbudakguy
Copy link
Member Author

last night I ran the algorithm from #10 on everything in out/jdsw (except the laozi, which we don't have an sbck edition of). if you take a peek at those files, you should see in the third column a note about whether the jdsw annotation matched the source text (in the sbck edition), the commentary (in the sbck edition), or wasn't found. taking a close look to see if the algorithm seems to be correct (and why things aren't found) would be super helpful at this point.

after that, depending on what you find, I'll implement the logic in this issue (which should be pretty similar to #10) in another script. when that script runs, it'll copy any of the annotations from out/jdsw that have the "source" note in the third column and paste them into the MISC column for the corresponding token in a new CoNLL-U file, which will be taken from out/zhengwen (i haven't generated all of these yet but a few tests are there). the output from this will go into out/aligned (similar to the manual alignment that you already did, but in CoNLL-U form).

@GDRom
Copy link
Member

GDRom commented Jun 10, 2022

Sounds perfect. I'll take the time this weekend and/or early next week to take a deep dive into this, and will keep you posted on how well that algorithm does.

Just following up on this: "except the laozi, which we don't have an sbck edition of" -- you might have overlooked this SBCK edition thereof?

@thatbudakguy
Copy link
Member Author

oh — indeed I did! the script that converts it was looking for a file named something like 001.txt, so it skipped over the one we have called 1.txt. hence no cleaned version of the laozi. I'll fix that, thank you!

@GDRom
Copy link
Member

GDRom commented Jun 10, 2022

That must have been my mistake -- sorry about the misnomer there!

@thatbudakguy
Copy link
Member Author

note to self: it's worth trying the needleman-wunsch global alignment algorithm here, just to see how it performs vs our homegrown one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants