The project involves implementing an Entity Resolution Pipeline on citation networks from ACM and DBLP datasets.
The project begins with two large dataset text files. Please save these in the local data
folder.
- [DBLP-Citation-network V8]
- [ACM-Citation-network V8]
(They can be found here)
In addition to the Python packages listed in requirements.txt
, the following prerequisites are required:
- Spark version 3.5.0
- Scala version 2.12.x
- GraphFrames version 0.8.3-spark3.5-s_2.12
cd path-to-this-project
pip install -e .
from erp import part1, part2, part3
part1() # cleaned data stored in "data"
part2() # results of all methods stored in "method_results.csv"
part3() # scalability test
- Preparing Data: Run
erp.preparing.prepare_data("path_to_txt_file")
for both text files. This will clean and extract the relevant data (citations from 1995-2004 by "SIGMOD" or "VLDB" venues). The resulting CSV files will be stored in thedata
folder. - Running Pipeline:
- Local Version: Run
erp.ER_pipeline(databasefilename1, databasefilename2, ERconfiguration, baseline=False, cluster=True, matched_output="path-to-output-file", cluster_output="path-to-output-file", isdp=False)
(inerp/main.py
) - DP Version: Run
erp.ER_pipeline(databasefilename1, databasefilename2, ERconfiguration, baseline=False, cluster=True, matched_output=F"path-to-output-file", cluster_output="path-to-output-file", isdp=True)
(it callsER_pipeline_dp
inerp/dperp.py
)
- Local Version: Run
- Configuration Options:
blocking_method
(String): Methods to reduce execution time{“Year”, “TwoYear”, “numAuthors”, “FirstLetterTitle”, “LastLetterTitle”, "FirstOrLastLetterTitle", “authorLastName”, “commonAuthors”, “commonAndNumAuthors”}
.matching_method
(String): Algorithms for entity matching{"Jaccard", "Combined"}
.clustering_method
(String): Algorithm for clustering{"basic"}
.threshold
(float): A value between 0.0-1.0 for the matching similarity threshold.output_filename
(String): Path and filename of clustering results to be saved.
ERconfiguration:
{
"matching_method": "Combined",
"blocking_method": "FirstOrLastLetterTitle",
"clustering_method": "basic",
"threshold": 0.7,
"output_filename": "clustering_results_local.csv"
}