flexmatcher.classify package

Submodules

flexmatcher.classify.charDistClassifier module

class flexmatcher.classify.charDistClassifier.CharDistClassifier[source]

Bases: flexmatcher.classify.classifier.Classifier

Classify the data-point using counts of character types in the data.

The CharDistClassifier extracts 7 simple features: number of white-space, digit, and alphabetical characters as well as their percentage and the total number of characters. Then it trains a logistic regression on top of these features.

Attributes:
labels (ndarray): Vector storing the labels of each data-point. features (ndarray): Matrix storing the extracting features. clf (LogisticRegression): The classifier instance. num_classes (int): Number of classes/columns to match to all_classes (ndarray): Sorted array of all possible classes
fit(data)[source]

Extracts features and labels from the data and fits a model.

Args:
data (dataframe): Training data (values and their correct column).
predict(data)[source]

Predict the class for a new given data.

Args:
data (dataframe): Dataframe of values to predict the column for.
predict_proba_ordered(probs, classes)[source]

Fills out the probability matrix with classes that were missing.

Args:
probs (list): list of probabilities, output of predict_proba classes_ (ndarray): list of classes from clf.classes_ all_classes (ndarray): list of all possible classes
predict_training(folds=5)[source]

Do cross-validation and return probabilities for each data-point.

Args:
folds (int): Number of folds used for prediction on training data.

flexmatcher.classify.classifier module

Implement classifier for FlexMatcher.

This module defines an interface for classifiers.

Todo:
  • Implement more relevant classifiers.
  • Implement simple rules (e.g., does data match a phone number?).
  • Shuffle data before k-fold cutting in predict_training.
class flexmatcher.classify.classifier.Classifier(data)[source]

Bases: object

Define classifier interface for FlexMatcher.

fit(data)[source]

Train based on the input training data.

predict(data)[source]

Predict for unseen data.

predict_training(folds)[source]

Predict the training data (using k-fold cross validation).

flexmatcher.classify.nGramClassifier module

class flexmatcher.classify.nGramClassifier.NGramClassifier(ngram_range=(1, 1), analyzer='word', count=True, n_features=200)[source]

Bases: flexmatcher.classify.classifier.Classifier

Classify data-points using counts of n-gram sequence of words or chars.

The NGramClassifier uses n-grams of words or characters (based on user preference) and extracts count features or binary features (based on user preference) to train a classifier. It uses a LogisticRegression classifier as its training model.

Attributes:
labels (ndarray): Vector storing the labels of each data-point. features (ndarray): Matrix storing the extracting features. vectorizer (object): Vectorizer for transforming text to features. It will be either of type CountVectorizer or HashingVectorizer. clf (LogisticRegression): The classifier instance. num_classes (int): Number of classes/columns to match to all_classes (ndarray): Sorted array of all possible classes
fit(data)[source]
Args:
data (dataframe): Training data (values and their correct column).
predict(data)[source]

Predict the class for a new given data.

Args:
data (dataframe): Dataframe of values to predict the column for.
predict_proba_ordered(probs, classes)[source]

Fills out the probability matrix with classes that were missing.

Args:
probs (list): list of probabilities, output of predict_proba classes_ (ndarray): list of classes from clf.classes_ all_classes (ndarray): list of all possible classes
predict_training(folds=5)[source]

Do cross-validation and return probabilities for each data-point.

Args:
folds (int): Number of folds used for prediction on training data.

Module contents