|
| 1 | +--- |
| 2 | +sidebar_label: lit_explorer |
| 3 | +title: formfyxer.lit_explorer |
| 4 | +--- |
| 5 | + |
| 6 | +#### recursive\_get\_id |
| 7 | + |
| 8 | +Pull ID values out of the LIST/NSMI results from Spot. |
| 9 | + |
| 10 | +#### spot |
| 11 | + |
| 12 | +Call the Spot API (https://spot.suffolklitlab.org) to classify the text of a PDF using |
| 13 | +the NSMIv2/LIST taxonomy (https://taxonomy.legal/), but returns only the IDs of issues found in the text. |
| 14 | + |
| 15 | +#### re\_case |
| 16 | + |
| 17 | +Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words |
| 18 | + |
| 19 | +#### regex\_norm\_field |
| 20 | + |
| 21 | +Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions. |
| 22 | +See: https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/document_variables |
| 23 | + |
| 24 | +#### reformat\_field |
| 25 | + |
| 26 | +Transforms a string of text into a snake_case variable close in length to `max_length` name by |
| 27 | +summarizing the string and stitching the summary together in snake_case. |
| 28 | +h/t https://towardsdatascience.com/nlp-building-a-summariser-68e0c19e3a93 |
| 29 | + |
| 30 | +#### norm |
| 31 | + |
| 32 | +Normalize a word vector. |
| 33 | + |
| 34 | +#### vectorize |
| 35 | + |
| 36 | +Vectorize a string of text. |
| 37 | + |
| 38 | +**Arguments**: |
| 39 | + |
| 40 | +- `text` - a string of multiple words to vectorize |
| 41 | +- `tools_token` - the token to tools.suffolklitlab.org, used for micro-service |
| 42 | + to reduce the amount of memory you need on your machine. If |
| 43 | + not passed, you need to have `en_core_web_lg` installed. NOTE: this |
| 44 | + last bit is nolonger correct, you have to use the micor-service |
| 45 | + as we have had to remove SpaCY due to a breaking change |
| 46 | + |
| 47 | +#### normalize\_name |
| 48 | + |
| 49 | +Normalize a field name, if possible to the Assembly Line conventions, and if |
| 50 | +not, to a snake_case variable name of appropriate length. |
| 51 | + |
| 52 | +HACK: temporarily all we do is re-case it and normalize it using regex rules. |
| 53 | +Will be replaced with call to LLM soon. |
| 54 | + |
| 55 | +#### cluster\_screens |
| 56 | + |
| 57 | +Groups the given fields into screens based on how much they are related. |
| 58 | + |
| 59 | +**Arguments**: |
| 60 | + |
| 61 | +- `fields` - a list of field names |
| 62 | +- `damping` - a value >= 0.5 and < 1. Tunes how related screens should be |
| 63 | +- `tools_token` - the token to tools.suffolklitlab.org, needed of doing |
| 64 | + micro-service vectorization |
| 65 | + |
| 66 | +- `Returns` - a suggested screen grouping, each screen name mapped to the list of fields on it |
| 67 | + |
| 68 | +## InputType Objects |
| 69 | + |
| 70 | +```python |
| 71 | +class InputType(Enum) |
| 72 | +``` |
| 73 | + |
| 74 | +Input type maps onto the type of input the PDF author chose for the field. We only |
| 75 | +handle text, checkbox, and signature fields. |
| 76 | + |
| 77 | +#### field\_types\_and\_sizes |
| 78 | + |
| 79 | +Transform the fields provided by get_existing_pdf_fields into a summary format. |
| 80 | +Result will look like: |
| 81 | +[ |
| 82 | +\{ |
| 83 | +"var_name": var_name, |
| 84 | +"type": "text | checkbox | signature", |
| 85 | +"max_length": n |
| 86 | +\} |
| 87 | +] |
| 88 | + |
| 89 | +## AnswerType Objects |
| 90 | + |
| 91 | +```python |
| 92 | +class AnswerType(Enum) |
| 93 | +``` |
| 94 | + |
| 95 | +Answer type describes the effort the user answering the form will require. |
| 96 | +"Slot-in" answers are a matter of almost instantaneous recall, e.g., name, address, etc. |
| 97 | +"Gathered" answers require looking around one's desk, for e.g., a health insurance number. |
| 98 | +"Third party" answers require picking up the phone to call someone else who is the keeper |
| 99 | +of the information. |
| 100 | +"Created" answers don't exist before the user is presented with the question. They may include |
| 101 | +a choice, creating a narrative, or even applying legal reasoning. "Affidavits" are a special |
| 102 | +form of created answers. |
| 103 | +See Jarret and Gaffney, Forms That Work (2008) |
| 104 | + |
| 105 | +#### classify\_field |
| 106 | + |
| 107 | +Apply heuristics to the field's original and "normalized" name to classify |
| 108 | +it as either a "slot-in", "gathered", "third party" or "created" field type. |
| 109 | + |
| 110 | +#### get\_adjusted\_character\_count |
| 111 | + |
| 112 | +Determines the bracketed length of an input field based on its max_length attribute, |
| 113 | +returning a float representing the approximate length of the field content. |
| 114 | + |
| 115 | +The function chunks the answers into 5 different lengths (checkboxes, 2 words, short, medium, and long) |
| 116 | +instead of directly using the character count, as forms can allocate different spaces |
| 117 | +for the same data without considering the space the user actually needs. |
| 118 | + |
| 119 | +**Arguments**: |
| 120 | + |
| 121 | +- `field` _FieldInfo_ - An object containing information about the input field, |
| 122 | + including the "max_length" attribute. |
| 123 | + |
| 124 | + |
| 125 | +**Returns**: |
| 126 | + |
| 127 | +- `float` - The approximate length of the field content, categorized into checkboxes, 2 words, short, |
| 128 | + medium, or long based on the max_length attribute. |
| 129 | + |
| 130 | + |
| 131 | +**Examples**: |
| 132 | + |
| 133 | + >>> get_adjusted_character_count(\{"type"\}: InputType.CHECKBOX) |
| 134 | + 4.7 |
| 135 | + >>> get_adjusted_character_count(\{"max_length": 100\}) |
| 136 | + 9.4 |
| 137 | + >>> get_adjusted_character_count(\{"max_length": 300\}) |
| 138 | + 230 |
| 139 | + >>> get_adjusted_character_count(\{"max_length": 600\}) |
| 140 | + 115 |
| 141 | + >>> get_adjusted_character_count(\{"max_length": 1200\}) |
| 142 | + 1150 |
| 143 | + |
| 144 | +#### time\_to\_answer\_field |
| 145 | + |
| 146 | +Apply a heuristic for the time it takes to answer the given field, in minutes. |
| 147 | +It is hand-written for now. |
| 148 | +It will factor in the input type, the answer type (slot in, gathered, third party or created), and the |
| 149 | +amount of input text allowed in the field. |
| 150 | +The return value is a function that can return N samples of how long it will take to answer the field (in minutes) |
| 151 | + |
| 152 | +#### time\_to\_answer\_form |
| 153 | + |
| 154 | +Provide an estimate of how long it would take an average user to respond to the questions |
| 155 | +on the provided form. |
| 156 | +We use signals such as the field type, name, and space provided for the response to come up with a |
| 157 | +rough estimate, based on whether the field is: |
| 158 | +1. fill in the blank |
| 159 | +2. gathered - e.g., an id number, case number, etc. |
| 160 | +3. third party: need to actually ask someone the information - e.g., income of not the user, anything else? |
| 161 | +4. created: |
| 162 | +a. short created (3 lines or so?) |
| 163 | +b. long created (anything over 3 lines) |
| 164 | + |
| 165 | +#### cleanup\_text |
| 166 | + |
| 167 | +Apply cleanup routines to text to provide more accurate readability statistics. |
| 168 | + |
| 169 | +#### text\_complete |
| 170 | + |
| 171 | +Run a prompt via openAI's API and return the result. |
| 172 | + |
| 173 | +**Arguments**: |
| 174 | + |
| 175 | +- `prompt` _str_ - The prompt to send to the API. |
| 176 | +- `max_tokens` _int, optional_ - The number of tokens to generate. Defaults to 500. |
| 177 | +- `creds` _Optional[OpenAiCreds], optional_ - The credentials to use. Defaults to None. |
| 178 | +- `temperature` _float, optional_ - The temperature to use. Defaults to 0. |
| 179 | + |
| 180 | +#### complete\_with\_command |
| 181 | + |
| 182 | +Combines some text with a command to send to open ai. |
| 183 | + |
| 184 | +#### needs\_calculations |
| 185 | + |
| 186 | +A conservative guess at if a given form needs the filler to make math calculations, |
| 187 | +something that should be avoided. If |
| 188 | + |
| 189 | +#### tools\_passive |
| 190 | + |
| 191 | +Ping passive voice API for list of sentences using the passive voice |
| 192 | + |
| 193 | +#### get\_passive\_sentences |
| 194 | + |
| 195 | +Return a list of tuples, where each tuple represents a |
| 196 | +sentence in which passive voice was detected along with a list of the |
| 197 | +starting and ending position of each fragment that is phrased in the passive voice. |
| 198 | +The combination of the two can be used in the PDFStats frontend to highlight the |
| 199 | +passive text in an individual sentence. |
| 200 | + |
| 201 | +Text can either be a string or a list of strings. |
| 202 | +If provided a single string, it will be tokenized with NTLK and |
| 203 | +sentences containing fewer than 2 words will be ignored. |
| 204 | + |
| 205 | +#### get\_citations |
| 206 | + |
| 207 | +Get citations and some extra surrounding context (the full sentence), if the citation is |
| 208 | +fewer than 5 characters (often eyecite only captures a section symbol |
| 209 | +for state-level short citation formats) |
| 210 | + |
| 211 | +#### get\_sensitive\_data\_types |
| 212 | + |
| 213 | +Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive |
| 214 | +fields grouped by type. A list of the old field names can also be provided. These fields should be in the same |
| 215 | +order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value |
| 216 | +will not contain the old field name, only the corresponding field name from the first parameter. |
| 217 | + |
| 218 | +The sensitive data types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security |
| 219 | +Number. |
| 220 | + |
| 221 | +#### substitute\_phrases |
| 222 | + |
| 223 | +Substitute phrases in the input string and return the new string and positions of substituted phrases. |
| 224 | + |
| 225 | +**Arguments**: |
| 226 | + |
| 227 | +- `input_string` _str_ - The input string containing phrases to be replaced. |
| 228 | +- `substitution_phrases` _Dict[str, str]_ - A dictionary mapping original phrases to their replacement phrases. |
| 229 | + |
| 230 | + |
| 231 | +**Returns**: |
| 232 | + |
| 233 | + Tuple[str, List[Tuple[int, int]]]: A tuple containing the new string with substituted phrases and a list of |
| 234 | + tuples, each containing the start and end positions of the substituted |
| 235 | + phrases in the new string. |
| 236 | + |
| 237 | + |
| 238 | +**Example**: |
| 239 | + |
| 240 | + >>> input_string = "The quick brown fox jumped over the lazy dog." |
| 241 | + >>> substitution_phrases = \{"quick brown": "swift reddish", "lazy dog": "sleepy canine"\} |
| 242 | + >>> new_string, positions = substitute_phrases(input_string, substitution_phrases) |
| 243 | + >>> print(new_string) |
| 244 | + "The swift reddish fox jumped over the sleepy canine." |
| 245 | + >>> print(positions) |
| 246 | + [(4, 17), (35, 48)] |
| 247 | + |
| 248 | +#### substitute\_neutral\_gender |
| 249 | + |
| 250 | +Substitute gendered phrases with neutral phrases in the input string. |
| 251 | +Primary source is https://github.com/joelparkerhenderson/inclusive-language |
| 252 | + |
| 253 | +#### substitute\_plain\_language |
| 254 | + |
| 255 | +Substitute complex phrases with simpler alternatives. |
| 256 | +Source of terms is drawn from https://www.plainlanguage.gov/guidelines/words/ |
| 257 | + |
| 258 | +#### transformed\_sentences |
| 259 | + |
| 260 | +Apply a function to a list of sentences and return only the sentences with changed terms. |
| 261 | +The result is a tuple of the original sentence, new sentence, and the starting and ending position |
| 262 | +of each changed fragment in the sentence. |
| 263 | + |
| 264 | +#### parse\_form |
| 265 | + |
| 266 | +Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the |
| 267 | +in_file with the new fields (if `rewrite=1`). If you pass a spot token, we will guess the |
| 268 | +NSMI code. If you pass openai creds, we will give suggestions for the title and description. |
| 269 | + |
| 270 | +#### form\_complexity |
| 271 | + |
| 272 | +Gets a single number of how hard the form is to complete. Higher is harder. |
| 273 | + |
0 commit comments