revscoring.utilities

This module implements a set of utilities for extracting features and train/testing revscoring.Model from the command-line. When the revscoring python package is installed, a revscoring utility should be available from the commandline. Run revscoring -h for more information:

cv_train

revscoring cv_train -h

Performs a cross-validation of a scorer model strategy across folds of
a dataset and then trains a final model on the entire set of data.  Note
that either --labels or --pop-rates must be specified for classifiers.

Usage:
    cv_train -h | --help
    cv_train <scoring-model> <features> <label>
             [--labels=<labels>]
             [--labels-config=<lc>]
             [-p=<kv>]... [-s=<kv>]...
             [-w=<lw>]... [-r=<lp>]...
             [-o=<p>]...
             [--version=<vers>]
             [--observations=<path>]
             [--model-file=<path>]
             [--folds=<num>]
             [--workers=<num>]
             [--center]
             [--scale]
             [--multilabel]
             [--debug]

Options:
    -h --help               Prints this documentation
    <scoring-model>         Classpath to a ScorerModel to construct
                            and train
    <features>              Classpath to an list of features to use when
                            constructing the model
    <label>                 The name of the field to be predicted
    --labels=<labels>       A comma-separated sequence of labels that will
                            be used for ordering labels statistics and
                            other presentations of the model.
    --labels-config=<lc>    Path to a file containing labels and its
                            configurations like population-rates and
                            weights
    -w --label-weight=<lw>  A label-weight pair that rescales adjusts the
                            cost of getting a specific label prediction
                            wrong.
    -r --pop-rate=<lp>      A label-proportion pair that rescales metrics
                            based on the rate that the label appears in the
                            population.  If not provided, sample rates will
                            be assumed to reflect population rates.
    -p --parameter=<kv>     A key-value argument pair to use when
                            constructing the <scoring-model>.
    --version=<vers>        A version to associate with the model
    --observations=<path>   Path to a file containing observations
                            containing a 'cache' [default: <stdin>]
    --model-file=<path>     Path to write a model file to
                            [default: <stdout>]
    --folds=<num>           The number of folds that should be used when
                            cross-validating. If set to 1, testing will be
                            skipped and a model will just be trained on
                            all observations [default: 10]
    --workers=<num>         The number of workers that should be used when
                            cross-validating
    --center                Features should be centered on a common axis
    --scale                 Features should be scaled to a common range
    --multilabel            Whether to perform multilabel classification
    --debug                 Print debug logging.

extract

revscoring extract -h

Extracts a list of `dependent` for a set of revisions.

Reads file containing revision observations, extracts dependents
(Features and Datasources), and writes extended observations out for future
use.

Usage:
    extract -h | --help
    extract <dependent>... --host=<url> [--input=<path>]
                                        [--output=<path>]
                                        [--extractors=<num>]
                                        [--batch-size=<num>]
                                        [--login]
                                        [--profile=<path>]
                                        [--verbose] [--debug]

Options:
    -h --help               Print this documentation
    <dependent>             Classpath to a dependent or a list of
                            dependents
    --host=<url>            The url pointing to a MediaWiki API to use
                            for extracting features
    --input=<path>          Path to a file containing rev_id-label pairs
                            [default: <stdin>]
    --output=<path>         Path to a file to write extracted data to
                            [default: <stdout>]
    --extractors=<num>      The number of extractors to run in parallel
                            [default: <cpu count>]
    --batch-size=<num>      The number of rev_ids to batch together per
                            request to the API [default: 50]
    --login                 If set, prompt for username and password
    --profile=<path>        Path to a file to write extraction profiling
                            output
    --verbose               Print dots and stuff
    --debug                 Print debug logging

fetch_text

revscoring fetch_text -h

Gets the text for a set of observations.  Will create a new field called
"text" with the content corresponding to the "rev_id".

Usage:
    fetch_text --host=<url>
               [--deleted-1st]
               [--input=<path>] [--output=<path>]
               [--threads=<num>]
               [--verbose] [--debug]

Options:
    -h --help        Print this documentation
    --host=<url>     The host URL of a MediaWiki installation to extract
                     text from
    --deleted-1st    Try to look up text in "deletedrevisions" first.
                     This is more performant when looking up text that
                     will be (mostly) deleted, but it will have no effect
                     on output.
    --input=<path>   Path to a file containing observations
                     [default: <stdin>]
    --output=<path>  Path to a file to write extended observations
                     [default: <stdout>]
    --threads=<num>  The number of parallel requests to submit to the MW
                     api [default: <cpu-count>]
    --verbose        Print dots and stuff to note progress
    --debug          Print debug logging

fit

revscoring fit -h

Fits a dependent (an extractable value like a Datasource or Feature) to
observed data.  These are often used along with bag-of-words
methods to reduce the feature space prior to training and testing a model
or to train a sub-model.

Usage:
    fit -h | --help
    fit <dependent> <label>
        [--input=<path>]
        [--datasource-file=<path>]
        [--debug]

Options:
    -h --help                 Prints this documentation
    <dependent>               The classpath to `Dependent`
                              that can be fit to observations
    <label>                   The label that should be predicted
    --input=<path>            Path to a file containing observations
                              [default: <stdin>]
    --datasource-file=<math>  Path to a file for writing out the trained
                              datasource [default: <stdout>]
    --debug                   Print debug logging.

model_info

revscoring model_info -h

Prints formatted information about a model file.


Usage:
    module_info -h | --help
    module_info <model-file> [<path>...] [--formatting=<type>]

Options:
    -h --help            Prints this documentation
    <model-file>         Path to a model file
    <path>               A model information path.  If no path is provided,
                         all default fields will be in the output.
    --formatting=<type>  What format to output the information?  "str" or
                         "json" [default: str]

score

revscoring score -h

Scores a set of revisions.

Usage:
    score (-h | --help)
    score <model-file> --host=<uri> [<rev_id>...]
          [--rev-ids=<path>] [--cache=<json>] [--caches=<json>]
          [--batch-size=<num>] [--io-workers=<num>] [--cpu-workers=<num>]
          [--debug] [--verbose]

Options:
    -h --help           Print this documentation
    <model-file>        Path to a model file
    --host=<url>        The url pointing to a MediaWiki API to use for
                        extracting features
    <rev_id>            A revision identifier to score.
    --rev-ids=<path>    The path to a file containing revision identifiers
                        to score (expects a column called 'rev_id').  If
                        any <rev_id> are provided, this argument is
                        ignored. [default: <stdin>]
    --cache=<json>      A JSON blob of cache values to use during
                        extraction for every call.
    --caches=<json>     A JSON blob of rev_id-->cache value pairs to use
                        during extraction
    --batch-size=<num>  The size of the revisions to batch when requesting
                        data from the API [default: 50]
    --io-workers=<num>  The number of worker processes to use for
                        requesting data from the API [default: <auto>]
    --cpu-workers=<num>  The number of worker processes to use for
                         extraction and scoring [default: <cpu-count>]
    --debug             Print debug logging
    --verbose           Print feature extraction debug logging

test_model

revscoring test_model -h

Tests a scorer model.  This utility expects to get a file of
tab-separated feature values and labels from which to test a model.

Usage:
    test_model -h | --help
    test_model <scorer_model> <label>
               [--observations=<path>]
               [--model-file=<path>]
               [--debug]

Options:
    -h --help               Prints this documentation
    <scoring-model>         Path to model file that already trained.
    <label>                 The name of the field to be predicted
    --observations=<path>   Path to a file containing observations
                            containing a 'cache' [default: <stdin>]
    --model-file=<path>     Path to write a model file to
    --debug                 Print debug logging.

tune