This is a deep learning framework for operating on XML structured data. It is implemented in PyTorch. The framework has modularized and extensible components for training, debugging, inference, checkpoints, model schema migrations etc. XML is the first class format for a large number of applications(All HTML web, office documents, SVG, etc).
In this release, we have implemented an equivalent of seq2seq. Given a set of input and output XMLs, the framework can automatic learn and then apply those transformations on novel XML inputs.
This is an alpha release. We appreciate any kind of feedback or contribution. In particular, we are looking for
- New Application scenarios in your domain of interest.
- Bug reports.
- Code contributions. If you would join the project, please contact on the forum.
- Encoder decoder architecture.
- Encoder designed to capture hierarchical structure of an XML.
- Tyipcal seq2seq models operate at a relatively small sentence length. Information flows linearly(unidirectional or bidirectional). Istead, XML data is hierarchical and requires information to flow along tree edges as well.
- Regular RNN for text and attributes of XML nodes.
- Inspired from GraphRNN for capturing structure of an XML tree.
- Respects order of children of an XML element.
- Order not treated as important in XML attributes.
- Decoder is designed to generate output XML.
- Use of attention(1, 2) to find the appropriate character position or XML node or node attribute to focus upon.
- Use of pointer networks for learning to verbatim copy portions of text from input XML.
- Custom GPU implementation of performance critical modules.
- Support for beam decoding during inference for better accuracy.
- Use of shortcut connections between layers in the network for a more stable convergence.
- Tensorboard integration(over pytorch tensors).
- Schema versioning: We keep tweaking our models. We often need a way to migrate training done on our old model into new schema. This can be called a kind of "self-transfer learning. This is supported via schema versioning.
This package supports Python 3.6. We recommend creating a new virtual environment for this project (using virtualenv or conda).
-
Install python and ninja. Use following commands on MacOS for installation using macports,
$ sudo port install python36
$ sudo port install py36-pip
$ sudo port select --set pip pip36
$ sudo port select --set python python36
$ sudo port install ninja -
Checkout the repository.
git clone https://github.com/nishantsharma/xml.ai
-
Install all python packages mentioned in requirements.txt.
$ sudo pip install -r requirements.txt
Currently, we are running on generated datasets. There are 3 toy datasets that we support generating.
S.No | Dataset ID | Description | Input Example | Output Example |
---|---|---|---|---|
1. | toy0 | Inverts node.text | <toyrev>ldhmo</toyrev> | <toyrev>omhdl</toyrev> |
2. | toy1 | Swaps parent and child node tags | <tag1><tag2 /></tag1> | <tag2><tag1 /></tag2> |
3. | toy2 | Swapping shipping and billing address fields. | Generated data compliant with schema.xsd. | The two addresses swapped. |
4. | toy3 | Children order is reversed. Attribute list is rotated. Tail and text swapped. |
<a><b p1="p1"></b> <c p2="p2"></c></a> | <a><c p1="p1"></c> <b p2="p2"></b></a> |
Run script to generate the reverse toy dataset. By default, the generated data is stored in data/inputs/<domainId>.
./scripts/generate.sh --domain toy1
./scripts/generate.sh --domain toy2
To get help on generation parameters, give the following command.
./scripts/generate.sh --domain toy1 --help
./scripts/generate.sh --domain toy2 --help
To continue last training run on the default domain.
./scripts/train.sh
To continue last training run for a specific domain.
./scripts/train.sh --domain toy1
./scripts/train.sh --domain toy2
For help.
./scripts/train.sh -h
To evalaute latest trained model of a domain.
./scripts/evaluate.sh --domain <domainId>
To evaluate on domain toy1.
./scripts/evaluate.sh --domain toy1
For help.
./scripts/evaluate.sh -h
To view tensorboard logs, first make sure that tensorboard is already installed.
pip3 install tensorboard
Then, run the following command
tensorboard --logdir ./data/training/runFolders/
Training checkpoints are organized by domainId, runNo, modelSchemaNo and function as shown in the following file structure.
data/
+-- training/
+-- runFolders/
+-- run.<runNo>.<domainId>_<modelSchemaNo>/
+-- run.00000.toy1_0/
+-- Chk<epochNo>.<batchNo>/
+-- input_vocab*.pt
+-- output_vocab.pt
+-- model.pt
+-- modelArgs
+-- trainer_states.pt
+-- testing/
+-- runFolders/
+-- run.00000.toy1_0/
+-- Chk*/
+-- run.<runNo>.<domainId>_<modelSchemaNo>/
+-- Chk*/
+-- inputs/
+-- <domainId>/
+-- dev/
+-- dataIn*.xml
+-- dataOut*.xml
+-- test/
+-- dataIn*.xml
+-- dataOut*.xml
+-- train/
+-- dataIn*.xml
+-- dataOut*.xml
The sample script by default saves checkpoints in the inputs/<domainId>
folder of the root directory. Look
at the usages of the sample code for more options, including resuming and loading from checkpoints.
The goal of this library is facilitating the development of XML-to-XML transformation techniques and applications.
We plan to bring following application scenarios to life.
- Given a few XMLs, propose an XML schema that best describes them. It maybe a standard open schema.
- XSLT Extractor: Given an input and output XML, generate the simplest XSLT which translates one to the other. Something like what prose does.
- Learn aesthetics transformations for common XML formats like SVG, PPT, DOC.
- ...
We have following on our roadmap.
- Currently, our decoder is generating output sequence and the learning process forces it to be XML. We want to directly generate output XML.
- We are generating the complete training output as text. Instead, we want to generate XML transformations. Think XSLT to turn input XML to output XML.
- We are operating at the supervised learning level. That may be good. But, imagine a scenario where a human is editing an XML(say his resume) for aesthetics. In that case, we can interpret aesthetics as an objective function. We would like to apply reinforcement learning to discover this underlying aesthetics objective function using reinforcement learning. One can use "Inverse Reinforment Learning" to discover the aesthetics objective function.
While constantly improving the performnce, quality of code and documentation, we will also focus on the following items:
- Identification and evaluation with benchmarks;
- Provide more flexible model options, improving the usability of the library;
- Support features in the new versions of PyTorch.
If you have any questions, bug reports, and feature requests, please open an issue on Github. For live discussions, please go to our Gitter lobby.
We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.