Skip to content

Commit f32bb15

Browse files
Initial analysis and setup for impounded addresses documentation
Co-authored-by: nonprofittechy <7645641+nonprofittechy@users.noreply.github.com>
1 parent a88365c commit f32bb15

File tree

3 files changed

+702
-0
lines changed

3 files changed

+702
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
sidebar_label: docx_wrangling
3+
title: formfyxer.docx_wrangling
4+
---
5+
6+
#### update\_docx
7+
8+
Update the document with the modified runs.
9+
10+
Note: OpenAI is probabilistic, so the modified run indices may not be correct.
11+
When the index of a run or paragraph is out of range, a new paragraph
12+
will be inserted at the end of the document or a new run at the end of the
13+
paragraph&#x27;s runs.
14+
15+
Take a careful look at the output document to make sure it is still correct.
16+
17+
**Arguments**:
18+
19+
- `document` - the docx.Document object, or the path to the DOCX file
20+
- `modified_runs` - a tuple of paragraph number, run number, the modified text, a question (not used), and whether a new paragraph should be inserted (for conditional text)
21+
22+
23+
**Returns**:
24+
25+
The modified document.
26+
27+
#### get\_docx\_repr
28+
29+
Return a JSON representation of the paragraphs and runs in the DOCX file.
30+
31+
**Arguments**:
32+
33+
- `docx_path` - path to the DOCX file
34+
35+
36+
**Returns**:
37+
38+
A JSON representation of the paragraphs and runs in the DOCX file.
39+
40+
#### get\_labeled\_docx\_runs
41+
42+
Scan the DOCX and return a list of modified text with Jinja2 variable names inserted.
43+
44+
**Arguments**:
45+
46+
- `docx_path` - path to the DOCX file
47+
- `docx_repr` - a string representation of the paragraphs and runs in the DOCX file, if docx_path is not provided. This might be useful if you want
48+
- `custom_people_names` - a tuple of custom names and descriptions to use in addition to the default ones. Like: (&quot;clients&quot;, &quot;the person benefiting from the form&quot;)
49+
50+
51+
**Returns**:
52+
53+
A list of tuples, each containing a paragraph number, run number, and the modified text of the run.
54+
55+
#### get\_modified\_docx\_runs
56+
57+
Use GPT to rewrite the contents of a DOCX file paragraph by paragraph. Does not handle tables, footers, or
58+
other structures yet.
59+
60+
This is a light wrapper that provides the structure of DOCX paragraphs and runs to your prompt
61+
to OpenAI to facilitate the rewriting of the document without disrupting formatting.
62+
63+
For example, this could be used to:
64+
* Remove any passive voice
65+
* Replace placeholder text with variable names
66+
* Rewrite to a 6th grade reading level
67+
* Do an advanced search and replace, without requiring you to use a regex
68+
69+
By default, the example prompt includes a sample like this:
70+
71+
[
72+
[0, 0, &quot;Dear &quot;],
73+
[0, 1, &quot;John Smith:&quot;],
74+
[1, 0, &quot;I hope this letter finds you well.&quot;],
75+
]
76+
77+
Your custom instructions should include an example of how the sample will be modified, like the one below:
78+
79+
Example reply, indicating paragraph, run, the new text, and a number indicating if this changes the
80+
current paragraph, adds one before, or adds one after (-1, 0, 1):
81+
82+
\{&quot;results&quot;:
83+
[
84+
[0, 1, &quot;Dear \{\{ other_parties[0] \}\}:&quot;, 0],
85+
[2, 0, &quot;\{%p if is_tenant %\}&quot;, -1],
86+
[3, 0, &quot;\{%p endif %\}&quot;, 1],
87+
]
88+
\}
89+
90+
You may also want to customize the input example to better match your use case.
91+
92+
**Arguments**:
93+
94+
- `docx_path` _str_ - path to the DOCX file
95+
- `docx_repr` _str_ - a string representation of the paragraphs and runs in the DOCX file, if docx_path is not provided.
96+
- `custom_example` _Optional[str]_ - a string containing the purpose and overview of the task
97+
instructions (str) a string containing specific instructions for the task
98+
- `openai_client` _Optional[OpenAI]_ - an OpenAI client object. If not provided a new one will be created.
99+
- `api_key` _Optional[str]_ - an OpenAI API key. If not provided, it will be obtained from the environment
100+
- `temperature` _float_ - the temperature to use when generating text. Lower temperatures are more conservative.
101+
102+
103+
**Returns**:
104+
105+
A list of tuples, each containing a paragraph number, run number, and the modified text of the run.
106+
107+
#### make\_docx\_plain\_language
108+
109+
Convert a DOCX file to plain language with the help of OpenAI.
110+
111+
#### modify\_docx\_with\_openai\_guesses
112+
113+
Uses OpenAI to guess the variable names for a document and then modifies the document with the guesses.
114+
115+
**Arguments**:
116+
117+
- `docx_path` _str_ - Path to the DOCX file to modify.
118+
119+
120+
**Returns**:
121+
122+
- `docx.Document` - The modified document, ready to be saved to the same or a new path
123+
Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
---
2+
sidebar_label: lit_explorer
3+
title: formfyxer.lit_explorer
4+
---
5+
6+
#### recursive\_get\_id
7+
8+
Pull ID values out of the LIST/NSMI results from Spot.
9+
10+
#### spot
11+
12+
Call the Spot API (https://spot.suffolklitlab.org) to classify the text of a PDF using
13+
the NSMIv2/LIST taxonomy (https://taxonomy.legal/), but returns only the IDs of issues found in the text.
14+
15+
#### re\_case
16+
17+
Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words
18+
19+
#### regex\_norm\_field
20+
21+
Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions.
22+
See: https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/document_variables
23+
24+
#### reformat\_field
25+
26+
Transforms a string of text into a snake_case variable close in length to `max_length` name by
27+
summarizing the string and stitching the summary together in snake_case.
28+
h/t https://towardsdatascience.com/nlp-building-a-summariser-68e0c19e3a93
29+
30+
#### norm
31+
32+
Normalize a word vector.
33+
34+
#### vectorize
35+
36+
Vectorize a string of text.
37+
38+
**Arguments**:
39+
40+
- `text` - a string of multiple words to vectorize
41+
- `tools_token` - the token to tools.suffolklitlab.org, used for micro-service
42+
to reduce the amount of memory you need on your machine. If
43+
not passed, you need to have `en_core_web_lg` installed. NOTE: this
44+
last bit is nolonger correct, you have to use the micor-service
45+
as we have had to remove SpaCY due to a breaking change
46+
47+
#### normalize\_name
48+
49+
Normalize a field name, if possible to the Assembly Line conventions, and if
50+
not, to a snake_case variable name of appropriate length.
51+
52+
HACK: temporarily all we do is re-case it and normalize it using regex rules.
53+
Will be replaced with call to LLM soon.
54+
55+
#### cluster\_screens
56+
57+
Groups the given fields into screens based on how much they are related.
58+
59+
**Arguments**:
60+
61+
- `fields` - a list of field names
62+
- `damping` - a value &gt;= 0.5 and &lt; 1. Tunes how related screens should be
63+
- `tools_token` - the token to tools.suffolklitlab.org, needed of doing
64+
micro-service vectorization
65+
66+
- `Returns` - a suggested screen grouping, each screen name mapped to the list of fields on it
67+
68+
## InputType Objects
69+
70+
```python
71+
class InputType(Enum)
72+
```
73+
74+
Input type maps onto the type of input the PDF author chose for the field. We only
75+
handle text, checkbox, and signature fields.
76+
77+
#### field\_types\_and\_sizes
78+
79+
Transform the fields provided by get_existing_pdf_fields into a summary format.
80+
Result will look like:
81+
[
82+
\{
83+
&quot;var_name&quot;: var_name,
84+
&quot;type&quot;: &quot;text | checkbox | signature&quot;,
85+
&quot;max_length&quot;: n
86+
\}
87+
]
88+
89+
## AnswerType Objects
90+
91+
```python
92+
class AnswerType(Enum)
93+
```
94+
95+
Answer type describes the effort the user answering the form will require.
96+
&quot;Slot-in&quot; answers are a matter of almost instantaneous recall, e.g., name, address, etc.
97+
&quot;Gathered&quot; answers require looking around one&#x27;s desk, for e.g., a health insurance number.
98+
&quot;Third party&quot; answers require picking up the phone to call someone else who is the keeper
99+
of the information.
100+
&quot;Created&quot; answers don&#x27;t exist before the user is presented with the question. They may include
101+
a choice, creating a narrative, or even applying legal reasoning. &quot;Affidavits&quot; are a special
102+
form of created answers.
103+
See Jarret and Gaffney, Forms That Work (2008)
104+
105+
#### classify\_field
106+
107+
Apply heuristics to the field&#x27;s original and &quot;normalized&quot; name to classify
108+
it as either a &quot;slot-in&quot;, &quot;gathered&quot;, &quot;third party&quot; or &quot;created&quot; field type.
109+
110+
#### get\_adjusted\_character\_count
111+
112+
Determines the bracketed length of an input field based on its max_length attribute,
113+
returning a float representing the approximate length of the field content.
114+
115+
The function chunks the answers into 5 different lengths (checkboxes, 2 words, short, medium, and long)
116+
instead of directly using the character count, as forms can allocate different spaces
117+
for the same data without considering the space the user actually needs.
118+
119+
**Arguments**:
120+
121+
- `field` _FieldInfo_ - An object containing information about the input field,
122+
including the &quot;max_length&quot; attribute.
123+
124+
125+
**Returns**:
126+
127+
- `float` - The approximate length of the field content, categorized into checkboxes, 2 words, short,
128+
medium, or long based on the max_length attribute.
129+
130+
131+
**Examples**:
132+
133+
&gt;&gt;&gt; get_adjusted_character_count(\{&quot;type&quot;\}: InputType.CHECKBOX)
134+
4.7
135+
&gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 100\})
136+
9.4
137+
&gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 300\})
138+
230
139+
&gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 600\})
140+
115
141+
&gt;&gt;&gt; get_adjusted_character_count(\{&quot;max_length&quot;: 1200\})
142+
1150
143+
144+
#### time\_to\_answer\_field
145+
146+
Apply a heuristic for the time it takes to answer the given field, in minutes.
147+
It is hand-written for now.
148+
It will factor in the input type, the answer type (slot in, gathered, third party or created), and the
149+
amount of input text allowed in the field.
150+
The return value is a function that can return N samples of how long it will take to answer the field (in minutes)
151+
152+
#### time\_to\_answer\_form
153+
154+
Provide an estimate of how long it would take an average user to respond to the questions
155+
on the provided form.
156+
We use signals such as the field type, name, and space provided for the response to come up with a
157+
rough estimate, based on whether the field is:
158+
1. fill in the blank
159+
2. gathered - e.g., an id number, case number, etc.
160+
3. third party: need to actually ask someone the information - e.g., income of not the user, anything else?
161+
4. created:
162+
a. short created (3 lines or so?)
163+
b. long created (anything over 3 lines)
164+
165+
#### cleanup\_text
166+
167+
Apply cleanup routines to text to provide more accurate readability statistics.
168+
169+
#### text\_complete
170+
171+
Run a prompt via openAI&#x27;s API and return the result.
172+
173+
**Arguments**:
174+
175+
- `prompt` _str_ - The prompt to send to the API.
176+
- `max_tokens` _int, optional_ - The number of tokens to generate. Defaults to 500.
177+
- `creds` _Optional[OpenAiCreds], optional_ - The credentials to use. Defaults to None.
178+
- `temperature` _float, optional_ - The temperature to use. Defaults to 0.
179+
180+
#### complete\_with\_command
181+
182+
Combines some text with a command to send to open ai.
183+
184+
#### needs\_calculations
185+
186+
A conservative guess at if a given form needs the filler to make math calculations,
187+
something that should be avoided. If
188+
189+
#### tools\_passive
190+
191+
Ping passive voice API for list of sentences using the passive voice
192+
193+
#### get\_passive\_sentences
194+
195+
Return a list of tuples, where each tuple represents a
196+
sentence in which passive voice was detected along with a list of the
197+
starting and ending position of each fragment that is phrased in the passive voice.
198+
The combination of the two can be used in the PDFStats frontend to highlight the
199+
passive text in an individual sentence.
200+
201+
Text can either be a string or a list of strings.
202+
If provided a single string, it will be tokenized with NTLK and
203+
sentences containing fewer than 2 words will be ignored.
204+
205+
#### get\_citations
206+
207+
Get citations and some extra surrounding context (the full sentence), if the citation is
208+
fewer than 5 characters (often eyecite only captures a section symbol
209+
for state-level short citation formats)
210+
211+
#### get\_sensitive\_data\_types
212+
213+
Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive
214+
fields grouped by type. A list of the old field names can also be provided. These fields should be in the same
215+
order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value
216+
will not contain the old field name, only the corresponding field name from the first parameter.
217+
218+
The sensitive data types are: Bank Account Number, Credit Card Number, Driver&#x27;s License Number, and Social Security
219+
Number.
220+
221+
#### substitute\_phrases
222+
223+
Substitute phrases in the input string and return the new string and positions of substituted phrases.
224+
225+
**Arguments**:
226+
227+
- `input_string` _str_ - The input string containing phrases to be replaced.
228+
- `substitution_phrases` _Dict[str, str]_ - A dictionary mapping original phrases to their replacement phrases.
229+
230+
231+
**Returns**:
232+
233+
Tuple[str, List[Tuple[int, int]]]: A tuple containing the new string with substituted phrases and a list of
234+
tuples, each containing the start and end positions of the substituted
235+
phrases in the new string.
236+
237+
238+
**Example**:
239+
240+
&gt;&gt;&gt; input_string = &quot;The quick brown fox jumped over the lazy dog.&quot;
241+
&gt;&gt;&gt; substitution_phrases = \{&quot;quick brown&quot;: &quot;swift reddish&quot;, &quot;lazy dog&quot;: &quot;sleepy canine&quot;\}
242+
&gt;&gt;&gt; new_string, positions = substitute_phrases(input_string, substitution_phrases)
243+
&gt;&gt;&gt; print(new_string)
244+
&quot;The swift reddish fox jumped over the sleepy canine.&quot;
245+
&gt;&gt;&gt; print(positions)
246+
[(4, 17), (35, 48)]
247+
248+
#### substitute\_neutral\_gender
249+
250+
Substitute gendered phrases with neutral phrases in the input string.
251+
Primary source is https://github.com/joelparkerhenderson/inclusive-language
252+
253+
#### substitute\_plain\_language
254+
255+
Substitute complex phrases with simpler alternatives.
256+
Source of terms is drawn from https://www.plainlanguage.gov/guidelines/words/
257+
258+
#### transformed\_sentences
259+
260+
Apply a function to a list of sentences and return only the sentences with changed terms.
261+
The result is a tuple of the original sentence, new sentence, and the starting and ending position
262+
of each changed fragment in the sentence.
263+
264+
#### parse\_form
265+
266+
Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the
267+
in_file with the new fields (if `rewrite=1`). If you pass a spot token, we will guess the
268+
NSMI code. If you pass openai creds, we will give suggestions for the title and description.
269+
270+
#### form\_complexity
271+
272+
Gets a single number of how hard the form is to complete. Higher is harder.
273+

0 commit comments

Comments
 (0)