flexmatcher.classify package¶
Submodules¶
flexmatcher.classify.charDistClassifier module¶
-
class
flexmatcher.classify.charDistClassifier.
CharDistClassifier
[source]¶ Bases:
flexmatcher.classify.classifier.Classifier
Classify the data-point using counts of character types in the data.
The CharDistClassifier extracts 7 simple features: number of white-space, digit, and alphabetical characters as well as their percentage and the total number of characters. Then it trains a logistic regression on top of these features.
- Attributes:
- labels (ndarray): Vector storing the labels of each data-point. features (ndarray): Matrix storing the extracting features. clf (LogisticRegression): The classifier instance. num_classes (int): Number of classes/columns to match to all_classes (ndarray): Sorted array of all possible classes
-
fit
(data)[source]¶ Extracts features and labels from the data and fits a model.
- Args:
- data (dataframe): Training data (values and their correct column).
-
predict
(data)[source]¶ Predict the class for a new given data.
- Args:
- data (dataframe): Dataframe of values to predict the column for.
-
predict_proba_ordered
(probs, classes)[source]¶ Fills out the probability matrix with classes that were missing.
- Args:
- probs (list): list of probabilities, output of predict_proba classes_ (ndarray): list of classes from clf.classes_ all_classes (ndarray): list of all possible classes
flexmatcher.classify.classifier module¶
Implement classifier for FlexMatcher.
This module defines an interface for classifiers.
- Todo:
- Implement more relevant classifiers.
- Implement simple rules (e.g., does data match a phone number?).
- Shuffle data before k-fold cutting in predict_training.
flexmatcher.classify.nGramClassifier module¶
-
class
flexmatcher.classify.nGramClassifier.
NGramClassifier
(ngram_range=(1, 1), analyzer='word', count=True, n_features=200)[source]¶ Bases:
flexmatcher.classify.classifier.Classifier
Classify data-points using counts of n-gram sequence of words or chars.
The NGramClassifier uses n-grams of words or characters (based on user preference) and extracts count features or binary features (based on user preference) to train a classifier. It uses a LogisticRegression classifier as its training model.
- Attributes:
- labels (ndarray): Vector storing the labels of each data-point. features (ndarray): Matrix storing the extracting features. vectorizer (object): Vectorizer for transforming text to features. It will be either of type CountVectorizer or HashingVectorizer. clf (LogisticRegression): The classifier instance. num_classes (int): Number of classes/columns to match to all_classes (ndarray): Sorted array of all possible classes
-
predict
(data)[source]¶ Predict the class for a new given data.
- Args:
- data (dataframe): Dataframe of values to predict the column for.
-
predict_proba_ordered
(probs, classes)[source]¶ Fills out the probability matrix with classes that were missing.
- Args:
- probs (list): list of probabilities, output of predict_proba classes_ (ndarray): list of classes from clf.classes_ all_classes (ndarray): list of all possible classes