revscoring.utilities¶
This module implements a set of utilities for extracting features and
train/testing revscoring.Model from the command-line. When the
revscoring python package is installed, a revscoring utility should be
available from the commandline. Run revscoring -h for more
information:
cv_train¶
revscoring cv_train -h
Performs a cross-validation of a scorer model strategy across folds of
a dataset and then trains a final model on the entire set of data. Note
that either --labels or --pop-rates must be specified for classifiers.
Usage:
cv_train -h | --help
cv_train <scoring-model> <features> <label>
[--labels=<labels>]
[--labels-config=<lc>]
[-p=<kv>]... [-s=<kv>]...
[-w=<lw>]... [-r=<lp>]...
[-o=<p>]...
[--version=<vers>]
[--observations=<path>]
[--model-file=<path>]
[--folds=<num>]
[--workers=<num>]
[--center]
[--scale]
[--multilabel]
[--debug]
Options:
-h --help Prints this documentation
<scoring-model> Classpath to a ScorerModel to construct
and train
<features> Classpath to an list of features to use when
constructing the model
<label> The name of the field to be predicted
--labels=<labels> A comma-separated sequence of labels that will
be used for ordering labels statistics and
other presentations of the model.
--labels-config=<lc> Path to a file containing labels and its
configurations like population-rates and
weights
-w --label-weight=<lw> A label-weight pair that rescales adjusts the
cost of getting a specific label prediction
wrong.
-r --pop-rate=<lp> A label-proportion pair that rescales metrics
based on the rate that the label appears in the
population. If not provided, sample rates will
be assumed to reflect population rates.
-p --parameter=<kv> A key-value argument pair to use when
constructing the <scoring-model>.
--version=<vers> A version to associate with the model
--observations=<path> Path to a file containing observations
containing a 'cache' [default: <stdin>]
--model-file=<path> Path to write a model file to
[default: <stdout>]
--folds=<num> The number of folds that should be used when
cross-validating. If set to 1, testing will be
skipped and a model will just be trained on
all observations [default: 10]
--workers=<num> The number of workers that should be used when
cross-validating
--center Features should be centered on a common axis
--scale Features should be scaled to a common range
--multilabel Whether to perform multilabel classification
--debug Print debug logging.
extract¶
revscoring extract -h
Extracts a list of `dependent` for a set of revisions.
Reads file containing revision observations, extracts dependents
(Features and Datasources), and writes extended observations out for future
use.
Usage:
extract -h | --help
extract <dependent>... --host=<url> [--input=<path>]
[--output=<path>]
[--extractors=<num>]
[--batch-size=<num>]
[--login]
[--profile=<path>]
[--verbose] [--debug]
Options:
-h --help Print this documentation
<dependent> Classpath to a dependent or a list of
dependents
--host=<url> The url pointing to a MediaWiki API to use
for extracting features
--input=<path> Path to a file containing rev_id-label pairs
[default: <stdin>]
--output=<path> Path to a file to write extracted data to
[default: <stdout>]
--extractors=<num> The number of extractors to run in parallel
[default: <cpu count>]
--batch-size=<num> The number of rev_ids to batch together per
request to the API [default: 50]
--login If set, prompt for username and password
--profile=<path> Path to a file to write extraction profiling
output
--verbose Print dots and stuff
--debug Print debug logging
fetch_text¶
revscoring fetch_text -h
Gets the text for a set of observations. Will create a new field called
"text" with the content corresponding to the "rev_id".
Usage:
fetch_text --host=<url>
[--deleted-1st]
[--input=<path>] [--output=<path>]
[--threads=<num>]
[--verbose] [--debug]
Options:
-h --help Print this documentation
--host=<url> The host URL of a MediaWiki installation to extract
text from
--deleted-1st Try to look up text in "deletedrevisions" first.
This is more performant when looking up text that
will be (mostly) deleted, but it will have no effect
on output.
--input=<path> Path to a file containing observations
[default: <stdin>]
--output=<path> Path to a file to write extended observations
[default: <stdout>]
--threads=<num> The number of parallel requests to submit to the MW
api [default: <cpu-count>]
--verbose Print dots and stuff to note progress
--debug Print debug logging
fit¶
revscoring fit -h
Fits a dependent (an extractable value like a Datasource or Feature) to
observed data. These are often used along with bag-of-words
methods to reduce the feature space prior to training and testing a model
or to train a sub-model.
Usage:
fit -h | --help
fit <dependent> <label>
[--input=<path>]
[--datasource-file=<path>]
[--debug]
Options:
-h --help Prints this documentation
<dependent> The classpath to `Dependent`
that can be fit to observations
<label> The label that should be predicted
--input=<path> Path to a file containing observations
[default: <stdin>]
--datasource-file=<math> Path to a file for writing out the trained
datasource [default: <stdout>]
--debug Print debug logging.
model_info¶
revscoring model_info -h
Prints formatted information about a model file.
Usage:
module_info -h | --help
module_info <model-file> [<path>...] [--formatting=<type>]
Options:
-h --help Prints this documentation
<model-file> Path to a model file
<path> A model information path. If no path is provided,
all default fields will be in the output.
--formatting=<type> What format to output the information? "str" or
"json" [default: str]
score¶
revscoring score -h
Scores a set of revisions.
Usage:
score (-h | --help)
score <model-file> --host=<uri> [<rev_id>...]
[--rev-ids=<path>] [--cache=<json>] [--caches=<json>]
[--batch-size=<num>] [--io-workers=<num>] [--cpu-workers=<num>]
[--debug] [--verbose]
Options:
-h --help Print this documentation
<model-file> Path to a model file
--host=<url> The url pointing to a MediaWiki API to use for
extracting features
<rev_id> A revision identifier to score.
--rev-ids=<path> The path to a file containing revision identifiers
to score (expects a column called 'rev_id'). If
any <rev_id> are provided, this argument is
ignored. [default: <stdin>]
--cache=<json> A JSON blob of cache values to use during
extraction for every call.
--caches=<json> A JSON blob of rev_id-->cache value pairs to use
during extraction
--batch-size=<num> The size of the revisions to batch when requesting
data from the API [default: 50]
--io-workers=<num> The number of worker processes to use for
requesting data from the API [default: <auto>]
--cpu-workers=<num> The number of worker processes to use for
extraction and scoring [default: <cpu-count>]
--debug Print debug logging
--verbose Print feature extraction debug logging
test_model¶
revscoring test_model -h
Tests a scorer model. This utility expects to get a file of
tab-separated feature values and labels from which to test a model.
Usage:
test_model -h | --help
test_model <scorer_model> <label>
[--observations=<path>]
[--model-file=<path>]
[--debug]
Options:
-h --help Prints this documentation
<scoring-model> Path to model file that already trained.
<label> The name of the field to be predicted
--observations=<path> Path to a file containing observations
containing a 'cache' [default: <stdin>]
--model-file=<path> Path to write a model file to
--debug Print debug logging.