Initial analysis and setup for impounded addresses documentation

Copilot · nonprofittechy · Copilot · commit f32bb1587d36 · 2025-09-09T13:31:40.000Z
Co-authored-by: nonprofittechy &lt;7645641+nonprofittechy@users.noreply.github.com&gt;
diff --git a/docs/components/formfyxer/docx_wrangling.md b/docs/components/formfyxer/docx_wrangling.md
@@ -0,0 +1,123 @@
+---
+sidebar_label: docx_wrangling
+title: formfyxer.docx_wrangling
+---
+
+#### update\_docx
+
+Update the document with the modified runs.
+
+Note: OpenAI is probabilistic, so the modified run indices may not be correct.
+When the index of a run or paragraph is out of range, a new paragraph
+will be inserted at the end of the document or a new run at the end of the
+paragraph&#x27;s runs.
+
+Take a careful look at the output document to make sure it is still correct.
+
+**Arguments**:
+
+- `document` - the docx.Document object, or the path to the DOCX file
+- `modified_runs` - a tuple of paragraph number, run number, the modified text, a question (not used), and whether a new paragraph should be inserted (for conditional text)
+  
+
+**Returns**:
+
+  The modified document.
+
+#### get\_docx\_repr
+
+Return a JSON representation of the paragraphs and runs in the DOCX file.
+
+**Arguments**:
+
+- `docx_path` - path to the DOCX file
+  
+
+**Returns**:
+
+  A JSON representation of the paragraphs and runs in the DOCX file.
+
+#### get\_labeled\_docx\_runs
+
+Scan the DOCX and return a list of modified text with Jinja2 variable names inserted.
+
+**Arguments**:
+
+- `docx_path` - path to the DOCX file
+- `docx_repr` - a string representation of the paragraphs and runs in the DOCX file, if docx_path is not provided. This might be useful if you want
+- `custom_people_names` - a tuple of custom names and descriptions to use in addition to the default ones. Like: (&quot;clients&quot;, &quot;the person benefiting from the form&quot;)
+  
+
+**Returns**:
+
+  A list of tuples, each containing a paragraph number, run number, and the modified text of the run.
+
+#### get\_modified\_docx\_runs
+
+Use GPT to rewrite the contents of a DOCX file paragraph by paragraph. Does not handle tables, footers, or
+other structures yet.
+
+This is a light wrapper that provides the structure of DOCX paragraphs and runs to your prompt
+to OpenAI to facilitate the rewriting of the document without disrupting formatting.
+
+For example, this could be used to:
+* Remove any passive voice
+* Replace placeholder text with variable names
+* Rewrite to a 6th grade reading level
+* Do an advanced search and replace, without requiring you to use a regex
+
+By default, the example prompt includes a sample like this:
+
+[
+[0, 0, &quot;Dear &quot;],
+[0, 1, &quot;John Smith:&quot;],
+[1, 0, &quot;I hope this letter finds you well.&quot;],
+]
+
+Your custom instructions should include an example of how the sample will be modified, like the one below:
+
+Example reply, indicating paragraph, run, the new text, and a number indicating if this changes the
+current paragraph, adds one before, or adds one after (-1, 0, 1):
+
+\{&quot;results&quot;:
+[
+[0, 1, &quot;Dear \{\{ other_parties[0] \}\}:&quot;, 0],
+[2, 0, &quot;\{%p if is_tenant %\}&quot;, -1],
+[3, 0, &quot;\{%p endif %\}&quot;, 1],
+]
+\}
+
+You may also want to customize the input example to better match your use case.
+
+**Arguments**:
+
+- `docx_path` _str_ - path to the DOCX file
+- `docx_repr` _str_ - a string representation of the paragraphs and runs in the DOCX file, if docx_path is not provided.
+- `custom_example` _Optional[str]_ - a string containing the purpose and overview of the task
+  instructions (str) a string containing specific instructions for the task
+- `openai_client` _Optional[OpenAI]_ - an OpenAI client object. If not provided a new one will be created.
+- `api_key` _Optional[str]_ - an OpenAI API key. If not provided, it will be obtained from the environment
+- `temperature` _float_ - the temperature to use when generating text. Lower temperatures are more conservative.
+  
+
+**Returns**:
+
+  A list of tuples, each containing a paragraph number, run number, and the modified text of the run.
+
+#### make\_docx\_plain\_language
+
+Convert a DOCX file to plain language with the help of OpenAI.
+
+#### modify\_docx\_with\_openai\_guesses
+
+Uses OpenAI to guess the variable names for a document and then modifies the document with the guesses.
+
+**Arguments**:
+
+- `docx_path` _str_ - Path to the DOCX file to modify.
+  
+
+**Returns**:
+
+- `docx.Document` - The modified document, ready to be saved to the same or a new path
+
diff --git a/docs/components/formfyxer/lit_explorer.md b/docs/components/formfyxer/lit_explorer.md
@@ -0,0 +1,273 @@
+---
+sidebar_label: lit_explorer
+title: formfyxer.lit_explorer
+---
+
+#### recursive\_get\_id
+
+Pull ID values out of the LIST/NSMI results from Spot.
+
+#### spot
+
+Call the Spot API (https://spot.suffolklitlab.org) to classify the text of a PDF using
+the NSMIv2/LIST taxonomy (https://taxonomy.legal/), but returns only the IDs of issues found in the text.
+
+#### re\_case
+
+Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words
+
+#### regex\_norm\_field
+
+Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions.
+See: https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/document_variables
+
+#### reformat\_field
+
+Transforms a string of text into a snake_case variable close in length to `max_length` name by
+summarizing the string and stitching the summary together in snake_case.
+h/t https://towardsdatascience.com/nlp-building-a-summariser-68e0c19e3a93
+
+#### norm
+
+Normalize a word vector.
+
+#### vectorize
+
+Vectorize a string of text.
+
+**Arguments**:
+
+- `text` - a string of multiple words to vectorize
+- `tools_token` - the token to tools.suffolklitlab.org, used for micro-service
+  to reduce the amount of memory you need on your machine. If
+  not passed, you need to have `en_core_web_lg` installed. NOTE: this
+  last bit is nolonger correct, you have to use the micor-service
+  as we have had to remove SpaCY due to a breaking change
+
+#### normalize\_name
+
+Normalize a field name, if possible to the Assembly Line conventions, and if
+not, to a snake_case variable name of appropriate length.
+
+HACK: temporarily all we do is re-case it and normalize it using regex rules.
+Will be replaced with call to LLM soon.
+
+#### cluster\_screens
+
+Groups the given fields into screens based on how much they are related.
+
+**Arguments**:
+
+- `fields` - a list of field names
+- `damping` - a value &gt;= 0.5 and &lt; 1. Tunes how related screens should be
+- `tools_token` - the token to tools.suffolklitlab.org, needed of doing
+  micro-service vectorization
+  
+- `Returns` - a suggested screen grouping, each screen name mapped to the list of fields on it
+
+## InputType Objects
+
+```python
+class InputType(Enum)
+```
+
+Input type maps onto the type of input the PDF author chose for the field. We only
+handle text, checkbox, and signature fields.
+
+#### field\_types\_and\_sizes
+
+Transform the fields provided by get_existing_pdf_fields into a summary format.
+Result will look like:
+[
+\{
+&quot;var_name&quot;: var_name,
+&quot;type&quot;: &quot;text | checkbox | signature&quot;,
+&quot;max_length&quot;: n
+\}
+]
+
+## AnswerType Objects
+
+```python
+class AnswerType(Enum)
+```
+
+Answer type describes the effort the user answering the form will require.
+&quot;Slot-in&quot; answers are a matter of almost instantaneous recall, e.g., name, address, etc.
+&quot;Gathered&quot; answers require looking around one&#x27;s desk, for e.g., a health insurance number.
+&quot;Third party&quot; answers require picking up the phone to call someone else who is the keeper
+of the information.
+&quot;Created&quot; answers don&#x27;t exist before the user is presented with the question. They may include
+a choice, creating a narrative, or even applying legal reasoning. &quot;Affidavits&quot; are a special
+form of created answers.
+See Jarret and Gaffney, Forms That Work (2008)
+
+#### classify\_field
+
+Apply heuristics to the field&#x27;s original and &quot;normalized&quot; name to classify
+it as either a &quot;slot-in&quot;, &quot;gathered&quot;, &quot;third party&quot; or &quot;created&quot; field type.
+
+#### get\_adjusted\_character\_count
+
+Determines the bracketed length of an input field based on its max_length attribute,
+returning a float representing the approximate length of the field content.
+
+The function chunks the answers into 5 different lengths (checkboxes, 2 words, short, medium, and long)
+instead of directly using the character count, as forms can allocate different spaces
+for the same data without considering the space the user actually needs.
+
+**Arguments**:
+
+- `field` _FieldInfo_ - An object containing information about the input field,
+  including the &quot;max_length&quot; attribute.
+  
+
+**Returns**:
+
+- `float` - The approximate length of the field content, categorized into checkboxes, 2 words, short,
+  medium, or long based on the max_length attribute.
+  
+
+**Examples**:
+
+  &gt;&gt;&gt; get_adjusted_character_count(\{&quot;type&quot;\}: InputType.CHECKBOX)
+  4.7
+  &gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 100\})
+  9.4
+  &gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 300\})
+  230
+  &gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 600\})
+  115
+  &gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 1200\})
+  1150
+
+#### time\_to\_answer\_field
+
+Apply a heuristic for the time it takes to answer the given field, in minutes.
+It is hand-written for now.
+It will factor in the input type, the answer type (slot in, gathered, third party or created), and the
+amount of input text allowed in the field.
+The return value is a function that can return N samples of how long it will take to answer the field (in minutes)
+
+#### time\_to\_answer\_form
+
+Provide an estimate of how long it would take an average user to respond to the questions
+on the provided form.
+We use signals such as the field type, name, and space provided for the response to come up with a
+rough estimate, based on whether the field is:
+1. fill in the blank
+2. gathered - e.g., an id number, case number, etc.
+3. third party: need to actually ask someone the information - e.g., income of not the user, anything else?
+4. created:
+a. short created (3 lines or so?)
+b. long created (anything over 3 lines)
+
+#### cleanup\_text
+
+Apply cleanup routines to text to provide more accurate readability statistics.
+
+#### text\_complete
+
+Run a prompt via openAI&#x27;s API and return the result.
+
+**Arguments**:
+
+- `prompt` _str_ - The prompt to send to the API.
+- `max_tokens` _int, optional_ - The number of tokens to generate. Defaults to 500.
+- `creds` _Optional[OpenAiCreds], optional_ - The credentials to use. Defaults to None.
+- `temperature` _float, optional_ - The temperature to use. Defaults to 0.
+
+#### complete\_with\_command
+
+Combines some text with a command to send to open ai.
+
+#### needs\_calculations
+
+A conservative guess at if a given form needs the filler to make math calculations,
+something that should be avoided. If
+
+#### tools\_passive
+
+Ping passive voice API for list of sentences using the passive voice
+
+#### get\_passive\_sentences
+
+Return a list of tuples, where each tuple represents a
+sentence in which passive voice was detected along with a list of the
+starting and ending position of each fragment that is phrased in the passive voice.
+The combination of the two can be used in the PDFStats frontend to highlight the
+passive text in an individual sentence.
+
+Text can either be a string or a list of strings.
+If provided a single string, it will be tokenized with NTLK and
+sentences containing fewer than 2 words will be ignored.
+
+#### get\_citations
+
+Get citations and some extra surrounding context (the full sentence), if the citation is
+fewer than 5 characters (often eyecite only captures a section symbol
+for state-level short citation formats)
+
+#### get\_sensitive\_data\_types
+
+Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive
+fields grouped by type. A list of the old field names can also be provided. These fields should be in the same
+order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value
+will not contain the old field name, only the corresponding field name from the first parameter.
+
+The sensitive data types are: Bank Account Number, Credit Card Number, Driver&#x27;s License Number, and Social Security
+Number.
+
+#### substitute\_phrases
+
+Substitute phrases in the input string and return the new string and positions of substituted phrases.
+
+**Arguments**:
+
+- `input_string` _str_ - The input string containing phrases to be replaced.
+- `substitution_phrases` _Dict[str, str]_ - A dictionary mapping original phrases to their replacement phrases.
+  
+
+**Returns**:
+
+  Tuple[str, List[Tuple[int, int]]]: A tuple containing the new string with substituted phrases and a list of
+  tuples, each containing the start and end positions of the substituted
+  phrases in the new string.
+  
+
+**Example**:
+
+  &gt;&gt;&gt; input_string = &quot;The quick brown fox jumped over the lazy dog.&quot;
+  &gt;&gt;&gt; substitution_phrases = \{&quot;quick brown&quot;: &quot;swift reddish&quot;, &quot;lazy dog&quot;: &quot;sleepy canine&quot;\}
+  &gt;&gt;&gt; new_string, positions = substitute_phrases(input_string, substitution_phrases)
+  &gt;&gt;&gt; print(new_string)
+  &quot;The swift reddish fox jumped over the sleepy canine.&quot;
+  &gt;&gt;&gt; print(positions)
+  [(4, 17), (35, 48)]
+
+#### substitute\_neutral\_gender
+
+Substitute gendered phrases with neutral phrases in the input string.
+Primary source is https://github.com/joelparkerhenderson/inclusive-language
+
+#### substitute\_plain\_language
+
+Substitute complex phrases with simpler alternatives.
+Source of terms is drawn from https://www.plainlanguage.gov/guidelines/words/
+
+#### transformed\_sentences
+
+Apply a function to a list of sentences and return only the sentences with changed terms.
+The result is a tuple of the original sentence, new sentence, and the starting and ending position
+of each changed fragment in the sentence.
+
+#### parse\_form
+
+Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the
+in_file with the new fields (if `rewrite=1`). If you pass a spot token, we will guess the
+NSMI code. If you pass openai creds, we will give suggestions for the title and description.
+
+#### form\_complexity
+
+Gets a single number of how hard the form is to complete. Higher is harder.
+
diff --git a/docs/components/formfyxer/pdf_wrangling.md b/docs/components/formfyxer/pdf_wrangling.md