flexmatcher package

Submodules

flexmatcher.flexmatcher module

Implement FlexMatcher.

This module is the main module of the FlexMatcher package and implements the FlexMatcher class.

Todo:
  • Extend the module to work with and without data or column names.
  • Allow users to add/remove classifiers.
  • Combine modules (i.e., create_training_data and training functions).
class flexmatcher.flexmatcher.FlexMatcher(dataframes, mappings, sample_size=300)[source]

Bases: object

Match a given schema to the mediated schema.

The FlexMatcher learns to match an input schema to a mediated schema. The class considers panda dataframes as databases and their column names as the schema. FlexMatcher learn to do schema matching by training on instances of dataframes and how their columns are matched against the mediated schema.

Attributes:
train_data (dataframe): Dataframe with 3 columns. The name of
the column in the schema, the value under that column and the name of the column in the mediated schema it was mapped to.
col_train_data (dataframe): Dataframe with 2 columns. The name
the column in the schema and the name of the column in the mediated schema it was mapped to.

data_src_num (int): Store the number of available data sources. classifier_list (list): List of classifiers used in the training. classifier_type (string): List containing the type of each classifier.

Possible values are ‘column’ and ‘value’ classifiers.
prediction_list (list): List of predictions on the training data
produced by each classifier.
weights (ndarray): A matrix where cell (i,j) captures how good the j-th
classifier is at predicting if a column should match the i-th column (where columns are sorted by name) in the mediated schema.

columns (list): The sorted list of column names in the mediated schema.

create_training_data(dataframes, mappings, sample_size)[source]

Transform dataframes and mappings into training data.

The method uses the names of columns as well as the data under each column as its training data. It also replaces missing values with ‘NA’.

Args:

dataframes (list): List of dataframes to train on. mapping (list): List of dictionaries mapping columns of dataframes

to columns in the mediated schema.
sample_size (int): The number of rows sampled from each dataframe
for training.
make_prediction(data)[source]

Map the schema of a given dataframe to the column of mediated schema.

The procedure runs each classifier and then uses the weights (learned by the meta-trainer) to combine the prediction of each classifier.

save_model(name)[source]

Serializes the FlexMatcher object into a model file using python’s picke library.

train()[source]

Train each classifier and the meta-classifier.

train_meta_learner()[source]

Train the meta-classifier.

The data used for training the meta-classifier is the probability of assigning each point to each column (or class) by each classifier. The learned weights suggest how good each classifier is at predicting a particular class.

Module contents