Matching background

Data Matching

Advanced algorithms for record matching, record search and building a 360° view of customer/citizen.

Why is it so hard?Each domain has special challenges when trying to match records from different sources to the real-world entity or object.

Person naming

  • Spelling variations
  • Nicknames
  • Name change
  • Cultural differences

Address details

  • Spelling variations
  • Missing fields
  • Standardisation against official address lists

Date information

  • Approximate date often written as 01/MM/YY or 01/01/YY

To meet these challenges, Factil has developed an highly accurate machine learning powered data matching system.

The system uses purpose-built comparison algorithms across a wide vector of features derived from person, relationship and location information.

Machine Learning Data MatchingA machine Learning data matching system has four main components

Labelling

Carefully selected record pairs are labelled as 'match' or 'distinct' by human experts to create a training set for system.

Training

Mathematical optimisation algorithms are used to compute the optimal weights for feature comparison to mimic the training set.

Scoring

Large number of record pairs that are roughly similar are compared to give a similarity score. Pairs with a similarity score above a threshold are marked as 'matched'.

Resolution

Each set of matched record pairs clustered within a data source and linked between data sources correspond to real-world entity or object, and are given a unique entity id.

Record Matching AccuracyMeasuring record matching accuracy is critical to success.

Two kinds of errors occur:

  • False positives: Two records that do not correspond to the same entity, but the system claims they are a match.
  • False negatives: Two records correspond to the same entity, but the system claims they are not a match.

Precision and Recall provide a measure of the relative error rates on a scale from 0 to 1 for false positives and false negatives, respectively.

Factil applies advanced statistical sampling techniques to measure accuracy and guide record matching improvements.

Record searchFast search for similar records

Given a search record, all matching records can be found with the same unique entity id, and all other similar records can be found based on the scoring results and ranked by similarity score.

Building a 360° viewUnderstand your customer/citizen better

Finding all duplicates with each data source and linking records between data sources provides basis for creating a 360° view of customer/citizen across your organisation.