flexmatcher.classify package¶

Submodules¶

flexmatcher.classify.charDistClassifier module¶

class flexmatcher.classify.charDistClassifier.CharDistClassifier[source]¶

Bases: flexmatcher.classify.classifier.Classifier

Classify the data-point using counts of character types in the data.

The CharDistClassifier extracts 7 simple features: number of white-space, digit, and alphabetical characters as well as their percentage and the total number of characters. Then it trains a logistic regression on top of these features.

Attributes:: labels (ndarray): Vector storing the labels of each data-point. features (ndarray): Matrix storing the extracting features. clf (LogisticRegression): The classifier instance. num_classes (int): Number of classes/columns to match to all_classes (ndarray): Sorted array of all possible classes

fit(data)[source]¶

Extracts features and labels from the data and fits a model.

Args:: data (dataframe): Training data (values and their correct column).

predict(data)[source]¶

Predict the class for a new given data.

Args:: data (dataframe): Dataframe of values to predict the column for.

predict_proba_ordered(probs, classes)[source]¶

Fills out the probability matrix with classes that were missing.

Args:: probs (list): list of probabilities, output of predict_proba classes_ (ndarray): list of classes from clf.classes_ all_classes (ndarray): list of all possible classes

predict_training(folds=5)[source]¶

Do cross-validation and return probabilities for each data-point.

Args:: folds (int): Number of folds used for prediction on training data.

flexmatcher.classify.classifier module¶

Implement classifier for FlexMatcher.

This module defines an interface for classifiers.

Todo:

Implement more relevant classifiers.
Implement simple rules (e.g., does data match a phone number?).
Shuffle data before k-fold cutting in predict_training.

class flexmatcher.classify.classifier.Classifier(data)[source]¶

Bases: object

Define classifier interface for FlexMatcher.

fit(data)[source]¶: Train based on the input training data.

predict(data)[source]¶: Predict for unseen data.

predict_training(folds)[source]¶: Predict the training data (using k-fold cross validation).

flexmatcher.classify.nGramClassifier module¶

class flexmatcher.classify.nGramClassifier.NGramClassifier(ngram_range=(1, 1), analyzer='word', count=True, n_features=200)[source]¶