Skip to content
This repository has been archived by the owner on Jan 19, 2019. It is now read-only.

Migrate dataset reader code from (scala) DeepQA Experiments to DeepQA #328

Open
liyi193328 opened this issue Apr 29, 2017 · 9 comments
Open

Comments

@liyi193328
Copy link

liyi193328 commented Apr 29, 2017

Firstly, Much thanks to this great project, which is what I would like to do; I'll continuously watch, use, and even contribute to this project.

But when I want to run some pipelines from scratch, but found that the data pre processing steps is in another project: https://github.com/allenai/deep_qa_experiments, the project's code is scala.

I think the preprocessing steps in another steps is complicated for someone wishing to start the stuff quickly.

@matt-gardner
Copy link
Contributor

matt-gardner commented Apr 29, 2017

Yes, as you can see in the README, the data processing code is currently in the scala library. That is for historical reasons. When we write new data processing code, it will almost certainly be in the python library.

However, it's not very high priority for us to migrate the data processing code, because we already have all of the data processed, we know how to use the scala library easily enough, and we have a lot of other things on our plate. This is a great place where contributions would be much appreciated.

For anyone who wants to contribute to this, it's as simple as taking a (scala) DatasetReader from the DeepQA Experiments library and converting it to a python script in the dataset_readers module in DeepQA. Most of these dataset readers are pretty simple, so it shouldn't take that much work to do this (the SquadSentenceSelectionReader script is complicated because it has fancy logic for mixing up the data in interesting ways. The corresponding reader for the standard SQuAD task is much simpler.)

If you just want to use the DeepQA Experiments library to get the data for you, the easiest way to do so is probably like this (steps shown for SQuAD, but are similar for other datasets):

  1. Download the dataset from wherever it lives (for SQuAD, that's here). Extract it if it's some kind of archive file.
  2. Modify the path in the experiment code to point to where you downloaded the files.
  3. Run the following from a terminal, in the base directory for DeepQA Experiments:
sbt console
scala> import com.mattg.util.FileUtil
scala> import org.allenai.deep_qa.pipeline.DatasetStep
scala> import org.allenai.deep_qa.experiments.datasets.SquadDatasets
scala> DatasetStep.create(SquadDatasets.trainDataset, new FileUtil).runPipeline()

And you can repeat that last step for the dev set, or for any other dataset you want to process.

@matt-gardner matt-gardner changed the title Data pre processing in Another project? Migrate dataset reader code from (scala) DeepQA Experiments to DeepQA Apr 29, 2017
@liyi193328
Copy link
Author

@matt-gardner I'll try these steps. Thanks.

@liyi193328
Copy link
Author

@matt-gardner when I do sbt console, it complains:

[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.clearnlp#clearnlp;2.0.3-allenai: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] com.clearnlp:clearnlp:2.0.3-allenai
[warn] +- org.allenai.openie:openie_2.11:4.2.6 (E:\active_project\deep_qa_experiments\build.sbt#L33-54)
[warn] +- org.allenai:deep-qa_2.11:0.2.5
[trace] Stack trace suppressed: run last :update for the full output.
[error] (
:update) sbt.ResolveException: unresolved dependency: com.clearnlp#clearnlp;2.0.3-allenai: not found
[error] Total time: 48 s, completed 2017-5-2 14:10:47

So Is there some missing? thanks.

@matt-gardner
Copy link
Contributor

Oh, yeah, sorry about that. I forgot about that dependency. I just removed it, so it should work now. Can you update your repo and try again?

@liyi193328
Copy link
Author

@matt-gardner after update, the dependency solved, but another issue is:
[warn] Credentials file C:\Users\liyi1.bintray.credentials does not exist
[info] Compiling 1 protobuf files to E:\active_project\deep_qa_experiments\target\scala-2.11\src_managed\main\compiled_protobuf
[info] Compiling schema E:\active_project\deep_qa_experiments\src\main\protobuf\message.proto
protoc-jar: protoc version: 300, detected platform: windows 10/amd64
protoc-jar: executing: [C:\Users\liyi1\AppData\Local\Temp\protoc7978892183318687115.exe, --plugin=protoc-gen-scala=C:\Users\liyi1\AppData\Local\Temp\scalapbgen3603827797058091070.bat, -IE:\active_project\deep_qa_experiments\src\main\protobuf, -IE:\active_project\deep_qa_experiments\target\protobuf_external, --scala_out=grpc:E:\active_project\deep_qa_experiments\target\scala-2.11\src_managed\main\compiled_protobuf, E:\active_project\deep_qa_experiments\src\main\protobuf\message.proto]
Traceback (most recent call last):
[trace] Stack trace suppressed: run last protobuf:protobufGenerate for the full output.
[error] (protobuf:protobufGenerate) protoc returned exit code: 1
File "C:\Users\liyi1\AppData\Local\Temp\scalapbgen6943038223903542987.py", line 6, in
[error] Total time: 1 s, completed 2017-5-3 17:02:24
s.sendall(content)

TypeError: a bytes-like object is required, not 'str'

I'm not familiar to these errors. The systems need linux or os? windows not ok? Thanks

@matt-gardner
Copy link
Contributor

Yeah, I have no idea what's going on there. I think the only thing I had to install to get the protobuf stuff to work was this: pip install grpcio grpcio-tools pyhocon, which shouldn't be affecting this step.
Can you get the full stack trace with last protobuf:protobufGenerate? Also try installing those python libraries, just to see if that fixes the issue.

This doesn't look like it's a windows issue to me, but even if we figure this out, I think the rest of the code has various places where / is hard-coded instead of using OS-independent paths, so I think you'll have a hard time. We know this runs on linux and macOS, but haven't run it on windows.

Another thing to consider is that at this point, it probably is less work to translate the ~50 lines of scala code in the dataset reader into python than it is to figure out what's going on here.

@liyi193328
Copy link
Author

@matt-gardner So nice to you, I use python3.5 in my environment, resulting to this error.
After fix it, import com.mattg.util.FileUtil is a another project in https://github.com/matt-gardner/util? So I can't import it without this depency.
Thanks. And I may plan to contribute to python3 code when having some time.

@matt-gardner
Copy link
Contributor

The util library is a dependency in the DeepQA Experiments library, and it's grabbed automatically when you run sbt from within that project. If you run sbt console from the root directory of where you cloned DeepQA Experiments, you should be able to import com.mattg.util.FileUtil without a problem.

@liyi193328
Copy link
Author

@matt-gardner can run it now. Thanks all the way.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants