Skip to content

Commit a7d2cfc

Browse files
committed
V1.0.4
- Containerization WSA module. - Poetry support - Readme improvements
1 parent a995a3d commit a7d2cfc

File tree

16 files changed

+1716
-52
lines changed

16 files changed

+1716
-52
lines changed

README.md

Lines changed: 70 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,38 @@
1-
# Wiki ES
1+
# Wiki Entity Summarization Pre-processing
22

33
## Overview
44

5-
## Getting Started
5+
This project focuses on the pre-processing steps required for the Wiki Entity Summarization (Wiki ES) project. It
6+
involves building the necessary databases and loading data from various sources to prepare for the entity summarization
7+
tasks.
68

7-
- Build [wikimapper](https://github.com/jcklie/wikimapper) database
9+
### Server Specifications
10+
11+
For the pre-processing steps, we used an r5a.4xlarge instance on AWS with the following specifications:
12+
13+
vCpu: 16 (AMD EPYC 7571, 16 MiB cache, 2.5 GHz)
14+
Memory: 128 GB (DDR4, 2667 MT/s)
15+
Storage: 500 GB (EBS, 2880 Max Bandwidth)
16+
17+
### Getting Started
18+
19+
To get started with the pre-processing, follow these steps:
20+
21+
1. Build the [wikimapper](https://github.com/jcklie/wikimapper) database:
822

923
```shell
1024
pip install wikimapper
1125
````
1226

13-
If you would like to download the latest version run the following
27+
If you would like to download the latest version, run the following:
1428

1529
```shell
1630
EN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path}
1731
wikimapper download enwiki-latest --dir $EN_WIKI_REDIRECT_AND_PAGES_PATH
18-
```
32+
```
1933

2034
After having `enwiki-{VERSION}-page.sql.gz`, `enwiki-{VERSION}-redirect.sql.gz`,
21-
and `enwiki-{VERSION}-page_props.sql.gz` loaded under your data directory:
35+
and `enwiki-{VERSION}-page_props.sql.gz` loaded under your data directory, run the following commands:
2236

2337
```shell
2438
VERSION={VERSION}
@@ -27,8 +41,8 @@ INDEX_DB_PATH="`pwd`/data/index_enwiki-$VERSION.db"
2741
wikimapper create enwiki-$VERSION --dumpdir $EN_WIKI_REDIRECT_AND_PAGES_PATH --target $INDEX_DB_PATH
2842
```
2943

30-
- Now load the created db into our Postgres database,
31-
read [pgloader's document](https://pgloader.readthedocs.io/en/latest/install.html) for the installation
44+
2. Load the created database into the Postgres database:
45+
read [pgloader's document](https://pgloader.readthedocs.io/en/latest/install.html) for the installation
3246
3347
```shell
3448
./config-files-generator.sh
@@ -44,7 +58,52 @@ EOT
4458
pgloader ./sqlite-to-page-migration.load
4559
```
4660
47-
After running the experiment we encountered some issues with the `wikimapper` library, so we developed the following script to correct the missing data:
48-
```shell
61+
3. Correct missing data: After running the experiments, some issues were encountered with the wikimapper library.
62+
To correct the missing data, run the following script:
63+
64+
```shell
4965
python3 missing_data_correction.py
50-
```
66+
```
67+
68+
## Data Sources
69+
70+
The pre-processing steps involve loading data from the following sources:
71+
72+
- **Wikidata**, [wikidatawiki latest version](https://dumps.wikimedia.org/wikidatawiki/latest/):
73+
First, download the latest version of the Wikidata dump. With the dump, you can run the following command to load the
74+
metadata of the Wikidata dataset into the Postgres database, and the relationships between the entities into the Neo4j
75+
database. This module is called `Wikidata Graph Builder (wdgp)`.
76+
```shell
77+
docker-compose up wdgp
78+
```
79+
- **Wikipedia**, [enwiki lastest version](https://dumps.wikimedia.org/enwiki/latest/):
80+
The Wikipedia pages are used to extract the abstract and infobox of the corresponding Wikidata entity. The abstract
81+
and infobox are then used to annotate the summary in Wikidata. To provide such information, you need to load the
82+
latest version of the Wikipedia dump into the Postgres database. This module is called `Wikipedia Page Extractor (
83+
wppe)`.
84+
```shell
85+
docker-compose up wppe
86+
```
87+
88+
## Summary Annotation
89+
90+
When both datasets are loaded into the databases, we start processing all the available pages in the Wikipedia dataset
91+
to extract the abstract and infobox of the corresponding Wikidata entity. Later, these pages are marked from the
92+
extracted data, and the edges containing the marked pages are marked as candidates. Since Wikidata is a heterogeneous
93+
graph with multiple types of edges, we need to pick the most relevant edge as a summary between two entities for the
94+
summarization task. This module is called `Wiki Summary Annotator (wsa)`, and we
95+
use [DistilBERT](https://arxiv.org/abs/1910.01108) to filter the most relevant edge.
96+
97+
```shell
98+
docker-compose up wsa
99+
```
100+
101+
## Conclusion
102+
103+
By running the above commands, you will have the necessary databases and data loaded to start the Wiki Entity
104+
Summarization project. The next steps involve providing a set of seed nodes based on your preference along with other
105+
configuration parameters to get a fully customized Entity Summarization Dataset.
106+
107+
## License
108+
109+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

commons/storage.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
from functools import lru_cache
66
from pathlib import Path
77
from functools import wraps
8-
from typing import List
98

109
from more_itertools import batched
1110

@@ -311,11 +310,6 @@ def wrapped(*args, **kwargs):
311310

312311
return wrapped
313312

314-
315-
# TODO create indexes if not exists
316-
# CREATE INDEX summary_summary_for_index FOR ()-[r:SUMMARY]->() ON (r.summary_for);
317-
# CREATE INDEX wiki_entity_entityName_index FOR (n:WikiEntity) ON (n.entityName);
318-
319313
@lru_cache(maxsize=1024)
320314
@manage_neo4j_session
321315
def fetch_relations(subject_qid: str, object_qid: str, session) -> list[tuple[str, str, str]]:

compose.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ services:
66
container_name: wikidata_graph_builder
77
env_file:
88
- configs/wdgb.env
9+
910
environment:
11+
- APP_DUMPFILES_pattern=*pages-articles*xml*.bz2
12+
- APP_EXECUTORPOOL_COREPOOLSIZE=20
1013
- APP_DUMPFILES_DIR=${WIKIDATA_DUMPS_PATH}
1114
- SPRING_DATASOURCE_URL=jdbc:postgresql://wiki-es-pg:${DB_PORT}/${DB_NAME}
1215
- SPRING_DATASOURCE_USERNAME=${DB_USER}
@@ -33,6 +36,8 @@ services:
3336
env_file:
3437
- configs/wdgb.env
3538
environment:
39+
- APP_DUMPFILES_pattern=*pages-articles*xml*.bz2
40+
- APP_EXECUTORPOOL_COREPOOLSIZE=20
3641
- APP_DUMPFILES_DIR=${WIKIPEDIA_DUMPS_PATH}
3742
- SPRING_DATASOURCE_URL=jdbc:postgresql://wiki-es-pg:${DB_PORT}/${DB_NAME}
3843
- SPRING_DATASOURCE_USERNAME=${DB_USER}
@@ -48,6 +53,22 @@ services:
4853
- wikipedia-dumps:${WIKIPEDIA_DUMPS_PATH}
4954
networks:
5055
- wiki-es-network
56+
wsa:
57+
build:
58+
context: .
59+
dockerfile: wiki_summary/Dockerfile
60+
container_name: wiki_summary_annotator
61+
depends_on:
62+
neo4j:
63+
condition: service_healthy
64+
postgres:
65+
condition: service_healthy
66+
links:
67+
- postgres
68+
volumes:
69+
- ./.env:/app/.env
70+
networks:
71+
- wiki-es-network
5172
postgres:
5273
hostname: wiki-es-pg
5374
image: 'postgres:latest'

config-files-generator.sh

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ POSTGRES_VOLUME_PATH=/data/pg-data
77
NEO4J_VOLUME_PATH=/data/neo4j-data
88
WIKIDATA_DUMPS_PATH=/data/wikidata/articles/
99
WIKIPEDIA_DUMPS_PATH=/data/wikipedia/articlesdump/
10-
JUPYTER_VOLUME_PATH=/data/jupyter-data
11-
OUTPUT_VOLUME_PATH=/data/wiki-es-output
1210
"
1311
}
1412

@@ -20,7 +18,6 @@ DB_USER=wikies
2018
DB_PASSWORD=password
2119
NEO4J_USER=neo4j
2220
NEO4J_PASSWORD=password
23-
JUPYTER_TOKEN=wikies
2421
"
2522
}
2623

@@ -35,7 +32,6 @@ NEO4J_HOST=wiki-es-neo
3532
apps() {
3633
echo '
3734
########## Application ##########
38-
MAX_DB_CONNECTION_POOL=25
3935
'
4036
}
4137

@@ -100,6 +96,13 @@ SPRING_DATASOURCE_PASSWORD=\${DB_PASSWORD}
10096
"
10197
}
10298

99+
wsa_env() {
100+
echo "# Wiki Summarization Annotator
101+
########################################################################
102+
MAX_DB_CONNECTION_POOL=25
103+
"
104+
}
105+
103106
initialize_env() {
104107
echo "*************************************"
105108
local env_file=$1
@@ -128,6 +131,7 @@ initialize_env "./configs/pg.env" pg_env
128131
initialize_env "./configs/neo4j.env" neo4j_env
129132
initialize_env "./configs/wdgb.env" wdgb_env
130133
initialize_env "./configs/wppe.env" wpte_env
134+
initialize_env "./configs/wsa.env" wsa_env
131135

132136
echo "*************************************"
133137
echo "Done."

0 commit comments

Comments
 (0)