automatically align jdsw with cleaned source texts #11

thatbudakguy · 2022-05-27T19:56:26Z

@GDRom has already done some work to do this manually; we want to see if we can automate it.

#10 is a prerequisite for getting the JDSW in shape to align.
#9 is a prerequisite for getting 正文 versions to align to.

this uses a modified version of the algorithm from #10:

Look thru cleaned JDSW from clean annotations of commentaries from jdsw #10 and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
For each key: value pair...
a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential)
b. If that key is found, take the JDSW annotation and insert it into the source text at that point
c. If that key isn't found, just skip since we'll already know about it from clean annotations of commentaries from jdsw #10

GDRom · 2022-06-08T16:24:41Z

Following up on this, as I will soon approach 2.d) from #10, to look into what's going on with the keys that couldn't be assigned.

Ideally, I would identify an underlying logic to that issue.
If not, however, but I'd identify individual keys within source text, should I already insert those also into relevant source texts in /out?

thatbudakguy · 2022-06-08T16:40:33Z

last night I ran the algorithm from #10 on everything in out/jdsw (except the laozi, which we don't have an sbck edition of). if you take a peek at those files, you should see in the third column a note about whether the jdsw annotation matched the source text (in the sbck edition), the commentary (in the sbck edition), or wasn't found. taking a close look to see if the algorithm seems to be correct (and why things aren't found) would be super helpful at this point.

after that, depending on what you find, I'll implement the logic in this issue (which should be pretty similar to #10) in another script. when that script runs, it'll copy any of the annotations from out/jdsw that have the "source" note in the third column and paste them into the MISC column for the corresponding token in a new CoNLL-U file, which will be taken from out/zhengwen (i haven't generated all of these yet but a few tests are there). the output from this will go into out/aligned (similar to the manual alignment that you already did, but in CoNLL-U form).

GDRom · 2022-06-10T17:34:44Z

Sounds perfect. I'll take the time this weekend and/or early next week to take a deep dive into this, and will keep you posted on how well that algorithm does.

Just following up on this: "except the laozi, which we don't have an sbck edition of" -- you might have overlooked this SBCK edition thereof?

thatbudakguy · 2022-06-10T18:06:25Z

oh — indeed I did! the script that converts it was looking for a file named something like 001.txt, so it skipped over the one we have called 1.txt. hence no cleaned version of the laozi. I'll fix that, thank you!

GDRom · 2022-06-10T18:13:30Z

That must have been my mistake -- sorry about the misnomer there!

See #11

thatbudakguy · 2022-08-05T23:10:53Z

note to self: it's worth trying the needleman-wunsch global alignment algorithm here, just to see how it performs vs our homegrown one.

See #11

thatbudakguy self-assigned this May 27, 2022

thatbudakguy added a commit that referenced this issue Aug 2, 2022

Stub alignjdsw test until it is re-implemented as pipe

cc49484

See #11

thatbudakguy added a commit that referenced this issue Aug 2, 2022

Stub alignjdsw test until it is re-implemented as pipe

7241d22

See #11

thatbudakguy mentioned this issue Aug 6, 2022

implement pipeline pattern for data transformations #22

Closed

thatbudakguy added a commit that referenced this issue Sep 25, 2022

Implement and stub tests for Alignment.align_annotations

59ef682

See #11

thatbudakguy mentioned this issue Nov 7, 2022

source/commentary alignment detection is incorrect #18

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatically align jdsw with cleaned source texts #11

automatically align jdsw with cleaned source texts #11

thatbudakguy commented May 27, 2022 •

edited

Loading

GDRom commented Jun 8, 2022

thatbudakguy commented Jun 8, 2022

GDRom commented Jun 10, 2022

thatbudakguy commented Jun 10, 2022

GDRom commented Jun 10, 2022

thatbudakguy commented Aug 5, 2022

automatically align jdsw with cleaned source texts #11

automatically align jdsw with cleaned source texts #11

Comments

thatbudakguy commented May 27, 2022 • edited Loading

GDRom commented Jun 8, 2022

thatbudakguy commented Jun 8, 2022

GDRom commented Jun 10, 2022

thatbudakguy commented Jun 10, 2022

GDRom commented Jun 10, 2022

thatbudakguy commented Aug 5, 2022

thatbudakguy commented May 27, 2022 •

edited

Loading