revscoring.datasources.meta

Meta-Features are classes that extend Datasource and implement common operations on other Datasource.

dicts

These meta-datasources operate on revscoring.Datasource’s that return dict’s

class revscoring.datasources.meta.dicts.keys(dict_datasource, name=None)[source]

Generates a set of dict keys

Parameters:
dict_datasource : revscoring.Datasource

A datasource that generates a dict

name : str

A name for the new datasource.

class revscoring.datasources.meta.dicts.values(dict_datasource, name=None)[source]

Generates a list of dict values

Parameters:
dict_datasource : revscoring.Datasource

A datasource that generates a dict

name : str

A name for the new datasource.

extractors

These meta-datasources operate on revscoring.Datasource’s that return str’s or list ( str ) and extract information from them.

class revscoring.datasources.meta.extractors.regex(regexes, text_datasource, regex_flags=<RegexFlag.IGNORECASE: 2>, wrapping=('\b', '\b'), exclusions=None, name=None)[source]

Generates a list of strings that match any of a set of privided regexes

Parameters:
regexes : list ( str )

A list of regexes to find in the text

text_datasource : revscoring.Datasource

A datasource that returns a str or a list of str

regex_flags : int

A set of regex flags to use in matching

wrapping : ( str, str )

Wrap all regexes with these values. This is useful for languages that have word boundaries.

name : str

A name for the new datasource

filters

These meta-datasources operate on revscoring.Datasource’s that return list’s and produce sub-lists.

class revscoring.datasources.meta.filters.filter(include, items_datasource, inverse=False, name=None)[source]

Generates a filtered list of items

Parameters:
include : func

A function that returns True when an item should be included

items_datasource : revscoring.Datasource

A datasource that generates a list of items

name : str

A name for the datasource.

class revscoring.datasources.meta.filters.regex_matching(regex, strs_datasource, name=None)[source]

Generates a filtered list of items

Parameters:
regex : str | compiled re

A regular expression to match (case-insensitive if a str is provided)

items_datasource : revscoring.Datasource

A datasource that generates a list of items

name : str

A name for the datasource.

class revscoring.datasources.meta.filters.positive(numbers_datasource, name=None)[source]

Generates a filtered list of positive numbers from a list of numbers.

Parameters:
numbers_datasource : revscoring.Datasource

A datasource that generates the subset of numbers that are positive

name : str

A name for the datasource.

class revscoring.datasources.meta.filters.negative(numbers_datasource, name=None)[source]

Generates a filtered list of negative numbers from a list of numbers.

Parameters:
numbers_datasource : revscoring.Datasource

A datasource that generates the subset of numbers that are negative

name : str

A name for the datasource.

frequencies

These meta-datasources operate on revscoring.Datasource’s that return list’s of items and produce frequency tables.

class revscoring.datasources.meta.frequencies.table(items_datasource, name=None)[source]

Generates a frequency table for a list of items generated by another datasource.

Parameters:
items_datasource : revscoring.Datasource

A datasource that generates a list of some hashable item

name : str

A name for the datasource.

class revscoring.datasources.meta.frequencies.delta(old_ft_datasource, new_ft_datasource, name=None)[source]

Generates a frequency table diff by comparing two frequency tables.

Parameters:
old_ft_datasource : revscoring.Datasource

A frequency table datasource

new_ft_datasource : revscoring.Datasource

A frequency table datasource

name : str

A name for the datasource.

class revscoring.datasources.meta.frequencies.prop_delta(old_ft_datasource, delta_datasource, name=None)[source]

Generates a proportional frequency table diff by comparing a frequency table diff with an old frequency table.

Parameters:
old_ft_datasource : revscoring.Datasource

A frequency table datasource

new_ft_datasource : revscoring.Datasource

A frequency table datasource

name : str

A name for the datasource.

class revscoring.datasources.meta.frequencies.positive(table_datasource, name=None)[source]

Filters a table (counts, delta, prop_delta, etc.) for positive values.

Parameters:
table_datasource : revscoring.Datasource

A frequency table datasource

name : str

A name for the datasource.

class revscoring.datasources.meta.frequencies.negative(table_datasource, absolute=False, name=None)[source]

Filters a table (counts, delta, prop_delta, etc.) for negative values.

Parameters:
table_datasource : revscoring.Datasource

A frequency table datasource

absolute : bool

Make negative values positive

name : str

A name for the datasource.

gramming

These meta-datasources operate on revscoring.Datasource’s that returns a list of strings (i.e. “tokens”) and produces a list of ngram/skipgram sequences.

class revscoring.datasources.meta.gramming.gram(items_datasource, grams=[(0,)], name=None)[source]

Converts a sequence of items into ngrams.

Parameters:
items_datasource : revscoring.Datasource

A datasource that generates a list of some item

grams : list ( tuple ( int ) )

A list of ngram and/or skipgram sequences to produce

name : str

A name for the datasource.

hashing

These meta-datasources operate on revscoring.Datasource’s that returns a list of strings (i.e. “tokens”) and produces a list of ngram/skipgram sequences.

class revscoring.datasources.meta.hashing.hash(items_datasource, n=1048576, name=None)[source]

Converts a sequence of items into a sequence of portable hashes (int) based on the result of applying str(). E.g. str([“foo”]) = ‘[“foo”]’

Parameters:
items_datasource : revscoring.Datasource

A datasource that generates a list of items to be hashed

n : int

The number of potential hashes that can be produced

name : str

A name for the datasource.

indexable

These meta-datasources operate on revscoring.Datasource’s that return list’s and tuple’s

class revscoring.datasources.meta.indexable.index(i, datasources, default=NotImplemented, name=None)[source]

Generates a datasource that returns the value that appears at i

Parameters:
i : int

The index of a value to return

default : mixed

The value to return if no value exists at i. If not specified, an IndexError will be raised

name : str

A name for the new datasource.

mappers

These meta-datasources operate on revscoring.Datasource’s that return list’s and apply a specific function to each item.

class revscoring.datasources.meta.mappers.map(apply, items_datasource, name=None)[source]

Returns a revscoring.Datasource that applies a function over a set of items generated by another datasource.

Parameters:
apply : func

A function to apply to each item generated by items_datasource

items_datasource : revscoring.Datasource

A datasource that generates a list of some item

name : str

A name for the datasource.

class revscoring.datasources.meta.mappers.lower_case(strs_datasource, name=None)[source]

Returns a revscoring.Datasource that lower cases a list of str returned by another datasource.

Parameters:
strs_datasource : revscoring.Datasource

A datasource that generates a list of str

name : str

A name for the datasource.

class revscoring.datasources.meta.mappers.derepeat(strs_datasource, name=None)[source]

Returns a revscoring.Datasource that prevents a list of str from having repeated characters (e.g. “foo” –> “fo”).

Parameters:
strs_datasource : revscoring.Datasource

A datasource that generates a list of str

name : str

A name for the datasource.

class revscoring.datasources.meta.mappers.abs(numbers_datasource, name=None)[source]

Returns a revscoring.Datasource that converts a list of numeric values into a list of absolute numeric values.

Parameters:
numbers_datasource : revscoring.Datasource

A datasource that generates a list of numeric values

name : str

A name for the datasource.

selectors

These meta-datasources operate on revscoring.Datasource’s that return a flat dict of key-value pairs (aka a “table”) and filter (“select”) keys and/or weight values.

class revscoring.datasources.meta.selectors.tfidf(table_datasource, max_terms=None, weight=True, boolean=False, name=None)[source]

Selects a subset of a frequency table based on term utility and applies TF-iDF weighting.

Parameters:
table_datasource : revscoring.Datasource

A datasource that generates a dict of term frequency counts

max_terms : int

The maximum number of terms that will be selected. The terms with the highest proportional representation in a label class are selected.

weight : bool

Should TF-iDF weighting be applied to output counts?

boolean : bool

Normalize counts to 0 (not in document) and 1 (in document). Note that negative frequencies will be converted to -1.

name : str

A name for the datasource.

class revscoring.datasources.meta.selectors.filter_keys(table_datasource, keys, name=None)[source]

Selects a subset of features (key/values) based a set of keys.

Parameters:
table_datasource : revscoring.Datasource

A datasource that generates a table including only the specified keys

keys : iterable ( hashable )

The keys to select from the table

name : str

A name for the datasource.

timestamp

These meta-datasources operate on revscoring.Datasource’s that return mwtypes.Timestamp of the given string.

class revscoring.datasources.meta.timestamp.Timestamp(timestamp_str, name=None)[source]

Generates a mwtypes.Timestamp of the given string

Parameters:
timestamp_str : str

Timestamp string in ISO format.

name : str

A name for the datasource.