Implementation of "AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction", ACL 2023.
A lightweight attribute extraction model that achieves above 96% F1-score on MEPAVE dataset ;)
Please install the dependencies first:
pip install -r requirements.txtusage: main.py [-h] [--name NAME] [--do_train] [--do_eval]
[--data_dir DATA_DIR] [--word_vocab WORD_VOCAB]
[--ontology_vocab ONTOLOGY_VOCAB] [--tokenizer TOKENIZER]
[--seed SEED] [--gpu_ids GPU_IDS] [--batch_size BATCH_SIZE]
[--lr LR] [--epoch EPOCH] [--emb_dim EMB_DIM]
[--encode_dim ENCODE_DIM] [--skip_subject SKIP_SUBJECT]
configuration
optional arguments:
-h, --help show this help message and exit
--name NAME Experiment name, for logging and saving models
--do_train Whether to run training.
--do_eval Whether to run eval on the test set.
--data_dir DATA_DIR The input data dir.
--word_vocab WORD_VOCAB
The vocabulary file.
--ontology_vocab ONTOLOGY_VOCAB
The ontology class file.
--tokenizer TOKENIZER
The tokenizer type.
--seed SEED The random seed for initialization
--gpu_ids GPU_IDS The GPU ids
--batch_size BATCH_SIZE
Total batch size for training.
--lr LR The initial learning rate for Adam.
--epoch EPOCH Total number of training epochs to perform.
--emb_dim EMB_DIM The dimension of the embedding
--encode_dim ENCODE_DIM
The dimension of the encoding
--skip_subject SKIP_SUBJECT
Whether to skip the subject
Download the dataset to the raw_data folder, and run python3 preprocess.py --dataset=xxxx to preprocess the data.
Using argument
--subject_guild Trueto enable the subject guild function.
Pre-processed NYT dataset is attached in the data folder, which can be used directly.
Benefiting from the parameter-efficiency of this model, we can easily train and inference the model, and evaluate the trained model weights conveniently.
The trained model weights are in runs/jave_best file, which is trained by default hyper-parameters.
We use the sample data in MEPAVE to demonstrate the usage of AtTGen.
- You can check the samples in
data/jave_samplefolder. - You can try this demonstration by directly running
python3 playground.py.
- Preparing the MEPAVE dataset
Due to licensing restrictions, we cannot provide this dataset directly, please apply a license to use here,
download the whole dataset and then put *.txt files in raw_data/jave folder.
- Preprocess the data
python3 preprocess.py --dataset=jave- Train the model
python3 main.py --do_train --gpu_ids=0 --data_dir=./data/jave/ --ontology_vocab=attribute_vocab.json --tokenizer=char --name=jave- Evaluate the model
python3 main.py --do_eval --gpu_ids=0 --data_dir=./data/jave/ --ontology_vocab=attribute_vocab.json --tokenizer=char --name=javepython3 main.py --gpu_ids=0 --data_dir=./data/CNShipNet/ --word_vocab=word_vocab.json --ontology_vocab=attribute_vocab.json --tokenizer=chn --do_trainpython3 main.py --gpu_ids=0 --data_dir=./data/nyt/ --ontology_vocab=relation_vocab.json --tokenizer=base --do_trainIf you found this work useful, please cite it as follows:
@inproceedings{li-etal-2023-attgen,
title = "AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction",
author = "Li, Yanzeng and
Xue, Bingcong and
Zhang, Ruoyu and
Zou, Lei",
booktitle = "Proceedings of The 61st Annual Meeting of the Association for Computational Linguistics",
month = july,
year = "2023",
address = "Toronto, Canada"
}
