CoWWLa-EM: Confidence Weighting of Weakly Labelled data for Entity Matching

Tue 19.08 11:15 - 12:00

Abstract: Entity Matching (EM) is a key task in data integration that focuses on identifying records from different data sources referring to the same real-world entity. Modern EM methods typically rely on supervised learning with large, high-quality labeled datasets -- a costly and labor-intensive requirement. To overcome this, increased attention has been devoted to alternative supervision strategies. In this work, we propose leveraging weakly labeled record pairs accompanied by an explicit label-confidence score for training a matcher model. Our approach introduces two simple yet effective modifications to the conventional supervised training pipeline, and also suggests a fully unsupervised pipeline for generating the required weak labels and confidence scores. The first adjustment, confidence-weighted sampling, resamples training data items based on their assigned confidence scores. The second adjustment is applied to the loss function, where we utilize the model's confidence to disregard possibly erroneous labels. We experimented with both simulated weak labels and our proposed labeling pipeline over a real-world setting, evaluating various configurations. Experimental results demonstrate that our approach consistently outperforms a baseline supervised approach, with performance gains becoming more pronounced as the quality of the weak labels decreases.

Speaker

Yair Zissu

Technion

  • Advisors Avigdor Gal

  • Academic Degree M.Sc.