Example Usage
Dependencies
Like text_explainability
, text_sensitivity
uses instances and machine learning models wrapped with the InstanceLib library.
Dataset and model
We manually create a TextEnvironment
, that holds both our ground-truth labels (.labels
) and our instances (.dataset
). Next, we fit a simple sklearn
model that predicts whether the instances (sentence-length strings) contain punctuation or not.
# Create a simple dataset (classify whether strings contain punctuation or not)
from instancelib.environment.text import TextEnvironment
instances = ['This is his example instance, not HERS!',
'An example sentence for you?!',
'She has her own sentence.',
'Provide him with something without any punctuation',
'RANDOM UPPERCASESTRING3']
labels = ['punctuation', 'punctuation', 'punctuation', 'no_punctuation', 'no_punctuation']
env = TextEnvironment.from_data(indices=list(range(len(instances))),
data=instances,
target_labels=list(set(labels)),
ground_truth=[[label] for label in labels],
vectors=[])
# Create sklearn model with pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
p = Pipeline([('vect', CountVectorizer()),
('rf', MultinomialNB())])
# Wrap sklearn model
from text_explainability import import_model
import_model(p, env)
Using Text Sensitivity
Text Sensitivity is used for robustness testing (verifying if a model can handle all types of string data and whether its predictions are invariant to minor changes) and fairness testing (comparing model performance on subgroups).
Robustness
A robust text model should be able to handle different types of input strings (e.g. ASCII, emojis) and be invariant to minor changes in inputs (e.g. converting a string to uppercase, adding an unrelated string or users making typos).
Generating random data
Random strings can be used for testing if a model is able to handle ass sorts of inputs:
from text_sensitivity import (RandomData, RandomDigits, RandomAscii, RandomEmojis,
RandomWhitespace, RandomCyrillic, combine_generators)
# Generate 10 instances with all printable characters
RandomData().generate_list(n=10, min_length=5, max_length=50)
# Generate 5 instances containing only digits
RandomDigits(seed=1).generate_list(n=5)
# Generate 15 instances, combining emojis, whitespace characters and ASCII characters
random_generator = combine_generators(RandomAscii(), RandomEmojis(), RandomWhitespace())
random_generator.generate_list(n=15)
# Generate 20 instances with random ASCII characters, whitespace and Russian (Cyrillic) characters
ascii_cyrillic_generator = combine_generators(RandomAscii(), RandomWhitespace(), RandomCyrillic(languages='ru'))
ascii_cyrillic_generator.generate_list(n=20)
Invariance testing
A very simple method for invariance testing, is assessing whether the model performs the same on a metric (e.g. accuracy, precision or recall) before and after applying a perturbation. For example, let us compare whether the model retains the same performance when converting all instances to lowercase:
from text_sensitivity.test import compare_accuracy
from text_sensitivity.perturbation.sentences import to_lower
compare_accuracy(env, model, to_lower)
Similarly, we can check whether precision scores are the same if we add an unrelated string after each sentence:
from text_sensitivity.test import compare_precision
from text_sensitivity.perturbation.base import OneToOnePerturbation
perturbation_fn = OneToOnePerturbation.from_string(suffix='This should not affect scores')
compare_precision(env, model, perturbation_fn)
Under the hood, text_sensitivity.test
uses text_sensitivity.perturbation
to perturb instances (instancelib.instances.text.TextInstance
or str
), and generates the new instances and labels for the original instance (e.g. ‘not_upper’) and the new instance(s) (e.g. ‘upper’).
from text_sensitivity.perturbation.sentences import to_upper, repeat_k_times
from text_sensitivity.perturbation.characters import random_case_swap, random_spaces, swap_random, add_typos
sample = 'This is his example string, made especially for HER!'
# Convert the sample string to all upper
list(to_upper()(sample))
# Repeat the string 'test' n times
list(repeat_k_times(n=3)('test'))
list(repeat_k_times(n=7, connector='\n')('test'))
# Randomly swap the character case (lower to upper or vice versa) in sample
list(random_case_swap()(sample))
# Add random spaces to words within a sentence, or swap characters randomly within a word (excluding stopwords and uppercase words) to sample
list(random_spaces(n=5)(sample))
list(swap_random(n=10, stopwords=['the' , 'is', 'of'], include_upper_case=False)(sample))
# Add typos (based on QWERTY keyboard) to sample
list(add_typos(n=10, stopwords=['the' , 'is', 'of'], include_numeric=False, include_special_char=False)(sample))
Fairness
TODO: Write up fairness.
Generating random data
Data for entities can be generated in the same manner as random strings:
from text_sensitivity import (RandomCity, RandomCountry, RandomName)
# Generates data for the current locale, e.g. if it is 'nl' it generates country names in Dutch and cities in the Netherlands
RandomCity().generate_list(n=10)
# If you specify the locale, it can generate the entity (e.g. country) for multiple languages
RandomCountry(languages=['nl', 'de', 'fr', 'jp']).generate_list(n=15)
Unlike random strings, random entities can also output the corresponding attribute labels for the generated data
# For example, generated Dutch and Russian male and female names, and output which language and sex they are
generator = RandomName(languages=['nl', 'ru'], sex=['male', 'female'], seed=5)
generator.generate_list(n=10, attributes=True)
# The same data can also be captured in an instancelib.InstanceProvider and instancelib.LabelProviders
generator.generate(n=10, attributes=True)
Other random entities that can be generated are dates, street addresses, emails, phone numbers, price tags and crypto names:
# Dates
from text_sensitivity import RandomYear, RandomMonth, RandomDay, RandomDayOfWeek
print(RandomYear().generate_list(n=3))
print(RandomMonth(languages=['nl', 'en']).upper().generate_list(n=6)) # use .upper() to generate all uppercase or .lower() for all lower
print(RandomDay().generate_list(n=3))
print(RandomDayOfWeek().sentence().generate_list(n=3)) # use .sentence() for all sentencecase or .title() for titlecase
# Street addresses, emails, phone numbers, price tags and crypto names
from text_sensitivity import RandomAddress, RandomEmail, RandomPhoneNumber, RandomPriceTag, RandomCryptoCurrency
print(RandomAddress(sep=', ').generate_list(n=5))
print(RandomEmail(languages=['es', 'pt']).generate_list(n=10, attributes=True))
print(RandomPhoneNumber().generate_list(n=5))
print(RandomPriceTag(languages=['ru', 'de', 'it', 'br']).generate_list(n=10))
print(RandomCryptoCurrency().generate_list(n=3))
Generating data from patterns
These entities, or your own lists, can be used to generate strings for locally testing model robustness/fairness. Text
within curly braces ({}
) is replaced, and attribute are added to each perturbed instance. The text outside of the curly
braces remains the same. Examples of patterns that can be put between curly braces are:
{a|b|c}
generates a list with elementsa
,b
andc
.{city}
usesRandomCity()
(in current locale) to generaten
random cities. For a full list of default patterns seefrom text_sensitivity import default_patterns; default_patterns()
.{custom_entity_name}
with keyword argumentcustom_entity_name=['this', 'is', 'cool]
generates a list with elementsthis
,is
,cool
.
from text_sensitivity import from_pattern
# Generate a list ['This is his house', 'This was his house', 'This is his car', 'This was his car', ...]:
from_pattern('This {is|was} his {house|car|boat}')
# Generate a list ['His home town is Eindhoven.', 'Her home town is Eindhoven.', 'His home town is Meerssen.', ...]. By default uses `RandomCity()` to generate the city name.
from_pattern('{His|Her} home town is {city}.')
# Override the 'city' default with your own list ['Amsterdam', 'Rotterdam', 'Utrecht']:
from_pattern('{His|Her} home town is {city}.', city=['Amsterdam', 'Rotterdam', 'Utrecht'])
In addition, modifiers can be added before a semicolon (:
) within a curly brace to modify the generated data. Example
modifiers are:
{lower:address}
generates addresses (RandomAddress()
for current locale) in all-lowercase{upper:name}
generates full name (RandomName()
for current locale) in all-uppercase{sentence:day_of_week}
generates day of week (RandomDayOfWeek()
for current locale) in sentencecase.{title:country}
generates country names (RandomCountry()
in locale language) in titlecase.
# Apply lower case to the first argument and uppercase to the last, getting ['Vandaag, donderdag heeft Sanne COLIN gebeld op +31612351983!', ..., 'Vandaag, maandag heeft Nora SEPP gebeld op +31612351983!', ...]
from_pattern('Vandaag, {lower:day_of_week}, heeft {first_name} {upper:first_name} gebeld op {phone_number}!', n=5)