Example Usage ============= Dependencies ------------ Like ``text_explainability``\ , ``text_sensitivity`` uses instances and machine learning models wrapped with the `InstanceLib `_ library. Dataset and model ----------------- We manually create a ``TextEnvironment``\ , that holds both our ground-truth labels (\ ``.labels``\ ) and our instances (\ ``.dataset``\ ). Next, we fit a simple ``sklearn`` model that predicts whether the instances (sentence-length strings) contain punctuation or not. .. code-block:: python # Create a simple dataset (classify whether strings contain punctuation or not) from instancelib.environment.text import TextEnvironment instances = ['This is his example instance, not HERS!', 'An example sentence for you?!', 'She has her own sentence.', 'Provide him with something without any punctuation', 'RANDOM UPPERCASESTRING3'] labels = ['punctuation', 'punctuation', 'punctuation', 'no_punctuation', 'no_punctuation'] env = TextEnvironment.from_data(indices=list(range(len(instances))), data=instances, target_labels=list(set(labels)), ground_truth=[[label] for label in labels], vectors=[]) # Create sklearn model with pipeline from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB p = Pipeline([('vect', CountVectorizer()), ('rf', MultinomialNB())]) # Wrap sklearn model from text_explainability import import_model import_model(p, env) Using Text Sensitivity ---------------------- Text Sensitivity is used for *robustness testing* (verifying if a model can handle all types of string data and whether its predictions are invariant to minor changes) and *fairness testing* (comparing model performance on subgroups). Robustness ^^^^^^^^^^ A robust text model should be able to handle different types of input strings (e.g. ASCII, emojis) and be invariant to minor changes in inputs (e.g. converting a string to uppercase, adding an unrelated string or users making typos). Generating random data ~~~~~~~~~~~~~~~~~~~~~~ Random strings can be used for testing if a model is able to handle ass sorts of inputs: .. code-block:: python from text_sensitivity import (RandomData, RandomDigits, RandomAscii, RandomEmojis, RandomWhitespace, RandomCyrillic, combine_generators) # Generate 10 instances with all printable characters RandomData().generate_list(n=10, min_length=5, max_length=50) # Generate 5 instances containing only digits RandomDigits(seed=1).generate_list(n=5) # Generate 15 instances, combining emojis, whitespace characters and ASCII characters random_generator = combine_generators(RandomAscii(), RandomEmojis(), RandomWhitespace()) random_generator.generate_list(n=15) # Generate 20 instances with random ASCII characters, whitespace and Russian (Cyrillic) characters ascii_cyrillic_generator = combine_generators(RandomAscii(), RandomWhitespace(), RandomCyrillic(languages='ru')) ascii_cyrillic_generator.generate_list(n=20) Invariance testing ~~~~~~~~~~~~~~~~~~ A very simple method for invariance testing, is assessing whether the model performs the same on a metric (e.g. accuracy, precision or recall) before and after applying a perturbation. For example, let us compare whether the model retains the same performance when converting all instances to lowercase: .. code-block:: python from text_sensitivity.test import compare_accuracy from text_sensitivity.perturbation.sentences import to_lower compare_accuracy(env, model, to_lower) Similarly, we can check whether precision scores are the same if we add an unrelated string after each sentence: .. code-block:: python from text_sensitivity.test import compare_precision from text_sensitivity.perturbation.base import OneToOnePerturbation perturbation_fn = OneToOnePerturbation.from_string(suffix='This should not affect scores') compare_precision(env, model, perturbation_fn) Under the hood, ``text_sensitivity.test`` uses ``text_sensitivity.perturbation`` to perturb instances (\ ``instancelib.instances.text.TextInstance`` or ``str``\ ), and generates the new instances and labels for the original instance (e.g. 'not_upper') and the new instance(s) (e.g. 'upper'). .. code-block:: python from text_sensitivity.perturbation.sentences import to_upper, repeat_k_times from text_sensitivity.perturbation.characters import random_case_swap, random_spaces, swap_random, add_typos sample = 'This is his example string, made especially for HER!' # Convert the sample string to all upper list(to_upper()(sample)) # Repeat the string 'test' n times list(repeat_k_times(n=3)('test')) list(repeat_k_times(n=7, connector='\n')('test')) # Randomly swap the character case (lower to upper or vice versa) in sample list(random_case_swap()(sample)) # Add random spaces to words within a sentence, or swap characters randomly within a word (excluding stopwords and uppercase words) to sample list(random_spaces(n=5)(sample)) list(swap_random(n=10, stopwords=['the' , 'is', 'of'], include_upper_case=False)(sample)) # Add typos (based on QWERTY keyboard) to sample list(add_typos(n=10, stopwords=['the' , 'is', 'of'], include_numeric=False, include_special_char=False)(sample)) Fairness ^^^^^^^^ *TODO*\ : Write up fairness. Generating random data ~~~~~~~~~~~~~~~~~~~~~~ Data for entities can be generated in the same manner as random strings: .. code-block:: python from text_sensitivity import (RandomCity, RandomCountry, RandomName) # Generates data for the current locale, e.g. if it is 'nl' it generates country names in Dutch and cities in the Netherlands RandomCity().generate_list(n=10) # If you specify the locale, it can generate the entity (e.g. country) for multiple languages RandomCountry(languages=['nl', 'de', 'fr', 'jp']).generate_list(n=15) Unlike random strings, random entities can also output the corresponding attribute labels for the generated data .. code-block:: python # For example, generated Dutch and Russian male and female names, and output which language and sex they are generator = RandomName(languages=['nl', 'ru'], sex=['male', 'female'], seed=5) generator.generate_list(n=10, attributes=True) # The same data can also be captured in an instancelib.InstanceProvider and instancelib.LabelProviders generator.generate(n=10, attributes=True) Other random entities that can be generated are dates, street addresses, emails, phone numbers, price tags and crypto names: .. code-block:: python # Dates from text_sensitivity import RandomYear, RandomMonth, RandomDay, RandomDayOfWeek print(RandomYear().generate_list(n=3)) print(RandomMonth(languages=['nl', 'en']).upper().generate_list(n=6)) # use .upper() to generate all uppercase or .lower() for all lower print(RandomDay().generate_list(n=3)) print(RandomDayOfWeek().sentence().generate_list(n=3)) # use .sentence() for all sentencecase or .title() for titlecase # Street addresses, emails, phone numbers, price tags and crypto names from text_sensitivity import RandomAddress, RandomEmail, RandomPhoneNumber, RandomPriceTag, RandomCryptoCurrency print(RandomAddress(sep=', ').generate_list(n=5)) print(RandomEmail(languages=['es', 'pt']).generate_list(n=10, attributes=True)) print(RandomPhoneNumber().generate_list(n=5)) print(RandomPriceTag(languages=['ru', 'de', 'it', 'br']).generate_list(n=10)) print(RandomCryptoCurrency().generate_list(n=3)) Generating data from patterns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These entities, or your own lists, can be used to generate strings for locally testing model robustness/fairness. Text within curly braces (\ ``{}``\ ) is replaced, and attribute are added to each perturbed instance. The text outside of the curly braces remains the same. Examples of patterns that can be put between curly braces are: * ``{a|b|c}`` generates a list with elements ``a``\ , ``b`` and ``c``. * ``{city}`` uses ``RandomCity()`` (in current locale) to generate ``n`` random cities. For a full list of default patterns see ``from text_sensitivity import default_patterns; default_patterns()``. * ``{custom_entity_name}`` with keyword argument ``custom_entity_name=['this', 'is', 'cool]`` generates a list with elements ``this``\ , ``is``\ , ``cool``. .. code-block:: python from text_sensitivity import from_pattern # Generate a list ['This is his house', 'This was his house', 'This is his car', 'This was his car', ...]: from_pattern('This {is|was} his {house|car|boat}') # Generate a list ['His home town is Eindhoven.', 'Her home town is Eindhoven.', 'His home town is Meerssen.', ...]. By default uses `RandomCity()` to generate the city name. from_pattern('{His|Her} home town is {city}.') # Override the 'city' default with your own list ['Amsterdam', 'Rotterdam', 'Utrecht']: from_pattern('{His|Her} home town is {city}.', city=['Amsterdam', 'Rotterdam', 'Utrecht']) In addition, modifiers can be added before a semicolon (\ ``:``\ ) within a curly brace to modify the generated data. Example modifiers are: * ``{lower:address}`` generates addresses (\ ``RandomAddress()`` for current locale) in all-lowercase * ``{upper:name}`` generates full name (\ ``RandomName()`` for current locale) in all-uppercase * ``{sentence:day_of_week}`` generates day of week (\ ``RandomDayOfWeek()`` for current locale) in sentencecase. * ``{title:country}`` generates country names (\ ``RandomCountry()`` in locale language) in titlecase. .. code-block:: python # Apply lower case to the first argument and uppercase to the last, getting ['Vandaag, donderdag heeft Sanne COLIN gebeld op +31612351983!', ..., 'Vandaag, maandag heeft Nora SEPP gebeld op +31612351983!', ...] from_pattern('Vandaag, {lower:day_of_week}, heeft {first_name} {upper:first_name} gebeld op {phone_number}!', n=5)