libb.fuzzymerge
- fuzzymerge(df1, df2, right_on, left_on, usedtype='uint8', scorer='WRatio', concat_value=True, **kwargs)[source]
Merge two DataFrames using fuzzy matching on specified columns.
Performs fuzzy matching between DataFrames based on specified columns, useful for matching data with small variations like typos or abbreviations.
- Parameters:
df1 (DataFrame) – First DataFrame to merge.
df2 (DataFrame) – Second DataFrame to merge.
right_on (str) – Column name in df2 for matching.
left_on (str) – Column name in df1 for matching.
usedtype – Data type for distance matrix (default: uint8).
scorer – Scoring function for fuzzy matching (default: WRatio).
concat_value (bool) – Add similarity scores column (default: True).
kwargs – Additional arguments for pandas.merge.
- Returns:
Merged DataFrame with fuzzy-matched rows.
- Return type:
DataFrame
Example:
>>> df1 = read_csv( ... "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv" ... ) >>> df2 = df1.copy() >>> df2 = concat([df2 for x in range(3)], ignore_index=True) >>> df2.Name = (df2.Name + random.uniform(1, 2000, len(df2)).astype("U")) >>> df1 = concat([df1 for x in range(3)], ignore_index=True) >>> df1.Name = (df1.Name + random.uniform(1, 2000, len(df1)).astype("U")) >>> df3 = fuzzymerge(df1, df2, right_on='Name', left_on='Name', usedtype=uint8, scorer=partial_ratio, ... concat_value=True)