Example Usage

Dependencies

Like text_explainability, text_sensitivity uses instances and machine learning models wrapped with the InstanceLib library.

Dataset and model

We manually create a TextEnvironment, that holds both our ground-truth labels (.labels) and our instances (.dataset). Next, we fit a simple sklearn model that predicts whether the instances (sentence-length strings) contain punctuation or not.

# Create a simple dataset (classify whether strings contain punctuation or not)
from instancelib.environment.text import TextEnvironment

instances = ['This is his example instance, not HERS!',
             'An example sentence for you?!',
             'She has her own sentence.',
             'Provide him with something without any punctuation',
             'RANDOM UPPERCASESTRING3']
labels = ['punctuation', 'punctuation', 'punctuation', 'no_punctuation', 'no_punctuation']

env = TextEnvironment.from_data(indices=list(range(len(instances))),
                                data=instances,
                                target_labels=list(set(labels)),
                                ground_truth=[[label] for label in labels],
                                vectors=[])

# Create sklearn model with pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

p = Pipeline([('vect', CountVectorizer()),
              ('rf', MultinomialNB())])

# Wrap sklearn model
from text_explainability import import_model
import_model(p, env)

Using Text Sensitivity

Text Sensitivity is used for robustness testing (verifying if a model can handle all types of string data and whether its predictions are invariant to minor changes) and fairness testing (comparing model performance on subgroups).

Robustness

A robust text model should be able to handle different types of input strings (e.g. ASCII, emojis) and be invariant to minor changes in inputs (e.g. converting a string to uppercase, adding an unrelated string or users making typos).

Generating random data

Random strings can be used for testing if a model is able to handle ass sorts of inputs:

from text_sensitivity import (RandomData, RandomDigits, RandomAscii, RandomEmojis,
                              RandomWhitespace, RandomCyrillic, combine_generators)

# Generate 10 instances with all printable characters
RandomData().generate_list(n=10, min_length=5, max_length=50)

# Generate 5 instances containing only digits
RandomDigits(seed=1).generate_list(n=5)

# Generate 15 instances, combining emojis, whitespace characters and ASCII characters
random_generator = combine_generators(RandomAscii(), RandomEmojis(), RandomWhitespace())
random_generator.generate_list(n=15)

# Generate 20 instances with random ASCII characters, whitespace and Russian (Cyrillic) characters
ascii_cyrillic_generator = combine_generators(RandomAscii(), RandomWhitespace(), RandomCyrillic(languages='ru'))
ascii_cyrillic_generator.generate_list(n=20)

Invariance testing

A very simple method for invariance testing, is assessing whether the model performs the same on a metric (e.g. accuracy, precision or recall) before and after applying a perturbation. For example, let us compare whether the model retains the same performance when converting all instances to lowercase:

from text_sensitivity.test import compare_accuracy
from text_sensitivity.perturbation.sentences import to_lower

compare_accuracy(env, model, to_lower)

Similarly, we can check whether precision scores are the same if we add an unrelated string after each sentence:

from text_sensitivity.test import compare_precision
from text_sensitivity.perturbation.base import OneToOnePerturbation

perturbation_fn = OneToOnePerturbation.from_string(suffix='This should not affect scores')
compare_precision(env, model, perturbation_fn)

Under the hood, text_sensitivity.test uses text_sensitivity.perturbation to perturb instances (instancelib.instances.text.TextInstance or str), and generates the new instances and labels for the original instance (e.g. ‘not_upper’) and the new instance(s) (e.g. ‘upper’).

from text_sensitivity.perturbation.sentences import to_upper, repeat_k_times
from text_sensitivity.perturbation.characters import random_case_swap, random_spaces, swap_random, add_typos

sample = 'This is his example string, made especially for HER!'

# Convert the sample string to all upper
list(to_upper()(sample))

# Repeat the string 'test' n times
list(repeat_k_times(n=3)('test'))
list(repeat_k_times(n=7, connector='\n')('test'))

# Randomly swap the character case (lower to upper or vice versa) in sample
list(random_case_swap()(sample))

# Add random spaces to words within a sentence, or swap characters randomly within a word (excluding stopwords and uppercase words) to sample
list(random_spaces(n=5)(sample))
list(swap_random(n=10, stopwords=['the' , 'is', 'of'], include_upper_case=False)(sample))

# Add typos (based on QWERTY keyboard) to sample
list(add_typos(n=10, stopwords=['the' , 'is', 'of'], include_numeric=False, include_special_char=False)(sample))

Fairness

TODO: Write up fairness.

Generating random data

Data for entities can be generated in the same manner as random strings:

from text_sensitivity import (RandomCity, RandomCountry, RandomName)

# Generates data for the current locale, e.g. if it is 'nl' it generates country names in Dutch and cities in the Netherlands
RandomCity().generate_list(n=10)

# If you specify the locale, it can generate the entity (e.g. country) for multiple languages
RandomCountry(languages=['nl', 'de', 'fr', 'jp']).generate_list(n=15)

Unlike random strings, random entities can also output the corresponding attribute labels for the generated data

# For example, generated Dutch and Russian male and female names, and output which language and sex they are
generator = RandomName(languages=['nl', 'ru'], sex=['male', 'female'], seed=5)
generator.generate_list(n=10, attributes=True)

# The same data can also be captured in an instancelib.InstanceProvider and instancelib.LabelProviders
generator.generate(n=10, attributes=True)

Other random entities that can be generated are dates, street addresses, emails, phone numbers, price tags and crypto names:

# Dates
from text_sensitivity import RandomYear, RandomMonth, RandomDay, RandomDayOfWeek

print(RandomYear().generate_list(n=3))
print(RandomMonth(languages=['nl', 'en']).upper().generate_list(n=6))  # use .upper() to generate all uppercase or .lower() for all lower
print(RandomDay().generate_list(n=3))
print(RandomDayOfWeek().sentence().generate_list(n=3))  # use .sentence() for all sentencecase or .title() for titlecase

# Street addresses, emails, phone numbers, price tags and crypto names
from text_sensitivity import RandomAddress, RandomEmail, RandomPhoneNumber, RandomPriceTag, RandomCryptoCurrency

print(RandomAddress(sep=', ').generate_list(n=5))
print(RandomEmail(languages=['es', 'pt']).generate_list(n=10, attributes=True))
print(RandomPhoneNumber().generate_list(n=5))
print(RandomPriceTag(languages=['ru', 'de', 'it', 'br']).generate_list(n=10))
print(RandomCryptoCurrency().generate_list(n=3))

Generating data from patterns

These entities, or your own lists, can be used to generate strings for locally testing model robustness/fairness. Text within curly braces ({}) is replaced, and attribute are added to each perturbed instance. The text outside of the curly braces remains the same. Examples of patterns that can be put between curly braces are:

{a|b|c} generates a list with elements a, b and c.
{city} uses RandomCity() (in current locale) to generate n random cities. For a full list of default patterns see from text_sensitivity import default_patterns; default_patterns().
{custom_entity_name} with keyword argument custom_entity_name=['this', 'is', 'cool] generates a list with elements this, is, cool.

from text_sensitivity import from_pattern

# Generate a list ['This is his house', 'This was his house', 'This is his car', 'This was his car', ...]:
from_pattern('This {is|was} his {house|car|boat}')

# Generate a list ['His home town is Eindhoven.', 'Her home town is Eindhoven.',  'His home town is Meerssen.', ...]. By default uses `RandomCity()` to generate the city name.
from_pattern('{His|Her} home town is {city}.')

# Override the 'city' default with your own list ['Amsterdam', 'Rotterdam', 'Utrecht']:
from_pattern('{His|Her} home town is {city}.', city=['Amsterdam', 'Rotterdam', 'Utrecht'])

In addition, modifiers can be added before a semicolon (:) within a curly brace to modify the generated data. Example modifiers are:

{lower:address} generates addresses (RandomAddress() for current locale) in all-lowercase
{upper:name} generates full name (RandomName() for current locale) in all-uppercase
{sentence:day_of_week} generates day of week (RandomDayOfWeek() for current locale) in sentencecase.
{title:country} generates country names (RandomCountry() in locale language) in titlecase.

# Apply lower case to the first argument and uppercase to the last, getting ['Vandaag, donderdag heeft Sanne COLIN gebeld op +31612351983!', ..., 'Vandaag, maandag heeft Nora SEPP gebeld op +31612351983!', ...]
from_pattern('Vandaag, {lower:day_of_week}, heeft {first_name} {upper:first_name} gebeld op {phone_number}!', n=5)