Entity Resolution Pipeline

The project involves implementing an Entity Resolution Pipeline on citation networks from ACM and DBLP datasets.

Data Source

The project begins with two large dataset text files. Please save these in the local data folder.

[DBLP-Citation-network V8]
[ACM-Citation-network V8]

(They can be found here)

Installation

In addition to the Python packages listed in requirements.txt, the following prerequisites are required:

Spark version 3.5.0
Scala version 2.12.x
GraphFrames version 0.8.3-spark3.5-s_2.12

cd path-to-this-project
pip install -e .

Samples

part1/2/3

from erp import part1, part2, part3
part1() # cleaned data stored in "data"
part2() # results of all methods stored in "method_results.csv"
part3() # scalability test

Quick Start

Preparing Data: Run erp.preparing.prepare_data("path_to_txt_file") for both text files. This will clean and extract the relevant data (citations from 1995-2004 by "SIGMOD" or "VLDB" venues). The resulting CSV files will be stored in the data folder.
Running Pipeline:
- Local Version: Run erp.ER_pipeline(databasefilename1, databasefilename2, ERconfiguration, baseline=False, cluster=True, matched_output="path-to-output-file", cluster_output="path-to-output-file", isdp=False) (in erp/main.py)
- DP Version: Run erp.ER_pipeline(databasefilename1, databasefilename2, ERconfiguration, baseline=False, cluster=True, matched_output=F"path-to-output-file", cluster_output="path-to-output-file", isdp=True) (it calls ER_pipeline_dp in erp/dperp.py)
Configuration Options:
- blocking_method (String): Methods to reduce execution time {“Year”, “TwoYear”, “numAuthors”, “FirstLetterTitle”, “LastLetterTitle”, "FirstOrLastLetterTitle", “authorLastName”, “commonAuthors”, “commonAndNumAuthors”}.
- matching_method (String): Algorithms for entity matching {"Jaccard", "Combined"}.
- clustering_method (String): Algorithm for clustering {"basic"}.
- threshold (float): A value between 0.0-1.0 for the matching similarity threshold.
- output_filename (String): Path and filename of clustering results to be saved.

Selected Configuration

ERconfiguration:

{
  "matching_method": "Combined",
  "blocking_method": "FirstOrLastLetterTitle",
  "clustering_method": "basic",
  "threshold": 0.7,
  "output_filename": "clustering_results_local.csv"
}

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
data		data
erp		erp
results		results
.gitignore		.gitignore
README.md		README.md
Report.md		Report.md
Report.pdf		Report.pdf
requirements.txt		requirements.txt
sample.py		sample.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity Resolution Pipeline

Data Source

Installation

Samples

Quick Start

Selected Configuration

About

Releases 1

Packages

Contributors 3

Languages

dendroenydris/entity-resolution-pipeline

Folders and files

Latest commit

History

Repository files navigation

Entity Resolution Pipeline

Data Source

Installation

Samples

Quick Start

Selected Configuration

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages