libb.fuzzymerge

fuzzymerge(df1, df2, right_on, left_on, usedtype='uint8', scorer='WRatio', concat_value=True, **kwargs)[source]

Merge two DataFrames using fuzzy matching on specified columns.

Performs fuzzy matching between DataFrames based on specified columns, useful for matching data with small variations like typos or abbreviations.

Parameters:
  • df1 (DataFrame) – First DataFrame to merge.

  • df2 (DataFrame) – Second DataFrame to merge.

  • right_on (str) – Column name in df2 for matching.

  • left_on (str) – Column name in df1 for matching.

  • usedtype – Data type for distance matrix (default: uint8).

  • scorer – Scoring function for fuzzy matching (default: WRatio).

  • concat_value (bool) – Add similarity scores column (default: True).

  • kwargs – Additional arguments for pandas.merge.

Returns:

Merged DataFrame with fuzzy-matched rows.

Return type:

DataFrame

Example:

>>> df1 = read_csv(  
...     "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
... )
>>> df2 = df1.copy()  
>>> df2 = concat([df2 for x in range(3)], ignore_index=True)  
>>> df2.Name = (df2.Name + random.uniform(1, 2000, len(df2)).astype("U"))  
>>> df1 = concat([df1 for x in range(3)], ignore_index=True)  
>>> df1.Name = (df1.Name + random.uniform(1, 2000, len(df1)).astype("U"))  
>>> df3 = fuzzymerge(df1, df2, right_on='Name', left_on='Name', usedtype=uint8, scorer=partial_ratio,  
...                         concat_value=True)