diff --git a/docs/components/FormFyxer/docx_wrangling.md b/docs/components/FormFyxer/docx_wrangling.md index 47f745009..a444a2602 100644 --- a/docs/components/FormFyxer/docx_wrangling.md +++ b/docs/components/FormFyxer/docx_wrangling.md @@ -1,9 +1,28 @@ +# Table of Contents + +* [formfyxer.docx\_wrangling](#formfyxer.docx_wrangling) + * [update\_docx](#formfyxer.docx_wrangling.update_docx) + * [get\_docx\_repr](#formfyxer.docx_wrangling.get_docx_repr) + * [get\_labeled\_docx\_runs](#formfyxer.docx_wrangling.get_labeled_docx_runs) + * [get\_modified\_docx\_runs](#formfyxer.docx_wrangling.get_modified_docx_runs) + * [make\_docx\_plain\_language](#formfyxer.docx_wrangling.make_docx_plain_language) + * [modify\_docx\_with\_openai\_guesses](#formfyxer.docx_wrangling.modify_docx_with_openai_guesses) + --- sidebar_label: docx_wrangling title: formfyxer.docx_wrangling --- -#### update\_docx + + +#### update\_docx(document: Union[docx.document.Document, str], modified\_runs: List[Tuple[int, int, str, int]]) + +```python +def update_docx( + document: Union[docx.document.Document, str], + modified_runs: List[Tuple[int, int, str, + int]]) -> docx.document.Document +``` Update the document with the modified runs. @@ -24,7 +43,15 @@ Take a careful look at the output document to make sure it is still correct. The modified document. -#### get\_docx\_repr + + +#### get\_docx\_repr(docx\_path: str, paragraph\_start: int = 0, paragraph\_end: Optional[int] = None) + +```python +def get_docx_repr(docx_path: str, + paragraph_start: int = 0, + paragraph_end: Optional[int] = None) +``` Return a JSON representation of the paragraphs and runs in the DOCX file. @@ -37,7 +64,18 @@ Return a JSON representation of the paragraphs and runs in the DOCX file. A JSON representation of the paragraphs and runs in the DOCX file. -#### get\_labeled\_docx\_runs + + +#### get\_labeled\_docx\_runs(docx\_path: Optional[str] = None, docx\_repr=Optional[str], custom\_people\_names: Optional[Tuple[str, str]] = None, openai\_client: Optional[OpenAI] = None, api\_key: Optional[str] = None) + +```python +def get_labeled_docx_runs( + docx_path: Optional[str] = None, + docx_repr=Optional[str], + custom_people_names: Optional[Tuple[str, str]] = None, + openai_client: Optional[OpenAI] = None, + api_key: Optional[str] = None) -> List[Tuple[int, int, str, int]] +``` Scan the DOCX and return a list of modified text with Jinja2 variable names inserted. @@ -52,7 +90,19 @@ Scan the DOCX and return a list of modified text with Jinja2 variable names inse A list of tuples, each containing a paragraph number, run number, and the modified text of the run. -#### get\_modified\_docx\_runs + + +#### get\_modified\_docx\_runs(docx\_path: Optional[str] = None, docx\_repr: Optional[str] = None, custom\_example: str = "", instructions: str = "", openai\_client: Optional[OpenAI] = None, api\_key: Optional[str] = None, temperature=0.5) + +```python +def get_modified_docx_runs(docx_path: Optional[str] = None, + docx_repr: Optional[str] = None, + custom_example: str = "", + instructions: str = "", + openai_client: Optional[OpenAI] = None, + api_key: Optional[str] = None, + temperature=0.5) -> List[Tuple[int, int, str, int]] +``` Use GPT to rewrite the contents of a DOCX file paragraph by paragraph. Does not handle tables, footers, or other structures yet. @@ -104,11 +154,23 @@ You may also want to customize the input example to better match your use case. A list of tuples, each containing a paragraph number, run number, and the modified text of the run. -#### make\_docx\_plain\_language + + +#### make\_docx\_plain\_language(docx\_path: str) + +```python +def make_docx_plain_language(docx_path: str) -> docx.document.Document +``` Convert a DOCX file to plain language with the help of OpenAI. -#### modify\_docx\_with\_openai\_guesses + + +#### modify\_docx\_with\_openai\_guesses(docx\_path: str) + +```python +def modify_docx_with_openai_guesses(docx_path: str) -> docx.document.Document +``` Uses OpenAI to guess the variable names for a document and then modifies the document with the guesses. diff --git a/docs/components/FormFyxer/lit_explorer.md b/docs/components/FormFyxer/lit_explorer.md index 9263ed35d..bd2548754 100644 --- a/docs/components/FormFyxer/lit_explorer.md +++ b/docs/components/FormFyxer/lit_explorer.md @@ -1,37 +1,121 @@ +# Table of Contents + +* [formfyxer.lit\_explorer](#formfyxer.lit_explorer) + * [recursive\_get\_id](#formfyxer.lit_explorer.recursive_get_id) + * [spot](#formfyxer.lit_explorer.spot) + * [re\_case](#formfyxer.lit_explorer.re_case) + * [regex\_norm\_field](#formfyxer.lit_explorer.regex_norm_field) + * [reformat\_field](#formfyxer.lit_explorer.reformat_field) + * [norm](#formfyxer.lit_explorer.norm) + * [vectorize](#formfyxer.lit_explorer.vectorize) + * [normalize\_name](#formfyxer.lit_explorer.normalize_name) + * [cluster\_screens](#formfyxer.lit_explorer.cluster_screens) + * [InputType](#formfyxer.lit_explorer.InputType) + * [field\_types\_and\_sizes](#formfyxer.lit_explorer.field_types_and_sizes) + * [AnswerType](#formfyxer.lit_explorer.AnswerType) + * [classify\_field](#formfyxer.lit_explorer.classify_field) + * [get\_adjusted\_character\_count](#formfyxer.lit_explorer.get_adjusted_character_count) + * [time\_to\_answer\_field](#formfyxer.lit_explorer.time_to_answer_field) + * [time\_to\_answer\_form](#formfyxer.lit_explorer.time_to_answer_form) + * [cleanup\_text](#formfyxer.lit_explorer.cleanup_text) + * [text\_complete](#formfyxer.lit_explorer.text_complete) + * [complete\_with\_command](#formfyxer.lit_explorer.complete_with_command) + * [needs\_calculations](#formfyxer.lit_explorer.needs_calculations) + * [tools\_passive](#formfyxer.lit_explorer.tools_passive) + * [get\_passive\_sentences](#formfyxer.lit_explorer.get_passive_sentences) + * [get\_citations](#formfyxer.lit_explorer.get_citations) + * [get\_sensitive\_data\_types](#formfyxer.lit_explorer.get_sensitive_data_types) + * [substitute\_phrases](#formfyxer.lit_explorer.substitute_phrases) + * [substitute\_neutral\_gender](#formfyxer.lit_explorer.substitute_neutral_gender) + * [substitute\_plain\_language](#formfyxer.lit_explorer.substitute_plain_language) + * [transformed\_sentences](#formfyxer.lit_explorer.transformed_sentences) + * [parse\_form](#formfyxer.lit_explorer.parse_form) + * [form\_complexity](#formfyxer.lit_explorer.form_complexity) + --- sidebar_label: lit_explorer title: formfyxer.lit_explorer --- -#### recursive\_get\_id + + +#### recursive\_get\_id(values\_to\_unpack: Union[dict, list], tmpl: Optional[set] = None) + +```python +def recursive_get_id(values_to_unpack: Union[dict, list], + tmpl: Optional[set] = None) +``` Pull ID values out of the LIST/NSMI results from Spot. -#### spot + + +#### spot(text: str, lower: float = 0.25, pred: float = 0.5, upper: float = 0.6, verbose: float = 0, token: str = "") + +```python +def spot(text: str, + lower: float = 0.25, + pred: float = 0.5, + upper: float = 0.6, + verbose: float = 0, + token: str = "") +``` Call the Spot API (https://spot.suffolklitlab.org) to classify the text of a PDF using the NSMIv2/LIST taxonomy (https://taxonomy.legal/), but returns only the IDs of issues found in the text. -#### re\_case + + +#### re\_case(text: str) + +```python +def re_case(text: str) -> str +``` Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words -#### regex\_norm\_field + + +#### regex\_norm\_field(text: str) + +```python +def regex_norm_field(text: str) +``` Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions. -See: https://assemblyline.suffolklitlab.org/docs/authoring/label_variables#fields-labels-and-variables +See: https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/document_variables + + + +#### reformat\_field(text: str, max\_length: int = 30, tools\_token: Optional[str] = None) -#### reformat\_field +```python +def reformat_field(text: str, + max_length: int = 30, + tools_token: Optional[str] = None) +``` Transforms a string of text into a snake_case variable close in length to `max_length` name by summarizing the string and stitching the summary together in snake_case. -h/t https://medium.com/data-science/nlp-building-a-summariser-68e0c19e3a93 +h/t https://towardsdatascience.com/nlp-building-a-summariser-68e0c19e3a93 -#### norm + + +#### norm(row) + +```python +def norm(row) +``` Normalize a word vector. -#### vectorize + + +#### vectorize(text: Union[List[str], str], tools\_token: Optional[str] = None) + +```python +def vectorize(text: Union[List[str], str], tools_token: Optional[str] = None) +``` Vectorize a string of text. @@ -40,9 +124,23 @@ Vectorize a string of text. - `text` - a string of multiple words to vectorize - `tools_token` - the token to tools.suffolklitlab.org, used for micro-service to reduce the amount of memory you need on your machine. If - not passed, you need to have `en_core_web_lg` installed + not passed, you need to have `en_core_web_lg` installed. NOTE: this + last bit is nolonger correct, you have to use the micor-service + as we have had to remove SpaCY due to a breaking change + + -#### normalize\_name +#### normalize\_name(jur: str, group: str, n: int, per, last\_field: str, this\_field: str, tools\_token: Optional[str] = None) + +```python +def normalize_name(jur: str, + group: str, + n: int, + per, + last_field: str, + this_field: str, + tools_token: Optional[str] = None) -> Tuple[str, float] +``` Normalize a field name, if possible to the Assembly Line conventions, and if not, to a snake_case variable name of appropriate length. @@ -50,7 +148,15 @@ not, to a snake_case variable name of appropriate length. HACK: temporarily all we do is re-case it and normalize it using regex rules. Will be replaced with call to LLM soon. -#### cluster\_screens + + +#### cluster\_screens(fields: List[str] = [], damping: float = 0.7, tools\_token: Optional[str] = None) + +```python +def cluster_screens(fields: List[str] = [], + damping: float = 0.7, + tools_token: Optional[str] = None) -> Dict[str, List[str]] +``` Groups the given fields into screens based on how much they are related. @@ -63,6 +169,8 @@ Groups the given fields into screens based on how much they are related. - `Returns` - a suggested screen grouping, each screen name mapped to the list of fields on it + + ## InputType Objects ```python @@ -72,7 +180,14 @@ class InputType(Enum) Input type maps onto the type of input the PDF author chose for the field. We only handle text, checkbox, and signature fields. -#### field\_types\_and\_sizes + + +#### field\_types\_and\_sizes(fields: Optional[Iterable[FormField]]) + +```python +def field_types_and_sizes( + fields: Optional[Iterable[FormField]]) -> List[FieldInfo] +``` Transform the fields provided by get_existing_pdf_fields into a summary format. Result will look like: @@ -84,6 +199,8 @@ Result will look like: \} ] + + ## AnswerType Objects ```python @@ -100,12 +217,24 @@ a choice, creating a narrative, or even applying legal reasoning. "Affidavi form of created answers. See Jarret and Gaffney, Forms That Work (2008) -#### classify\_field + + +#### classify\_field(field: FieldInfo, new\_name: str) + +```python +def classify_field(field: FieldInfo, new_name: str) -> AnswerType +``` Apply heuristics to the field's original and "normalized" name to classify it as either a "slot-in", "gathered", "third party" or "created" field type. -#### get\_adjusted\_character\_count + + +#### get\_adjusted\_character\_count(field: FieldInfo) + +```python +def get_adjusted_character_count(field: FieldInfo) -> float +``` Determines the bracketed length of an input field based on its max_length attribute, returning a float representing the approximate length of the field content. @@ -139,7 +268,16 @@ for the same data without considering the space the user actually needs. >>> get_adjusted_character_count(\{"max_length": 1200\}) 1150 -#### time\_to\_answer\_field + + +#### time\_to\_answer\_field(field: FieldInfo, new\_name: str, cpm: int = 40, cpm\_std\_dev: int = 17) + +```python +def time_to_answer_field(field: FieldInfo, + new_name: str, + cpm: int = 40, + cpm_std_dev: int = 17) -> Callable[[int], np.ndarray] +``` Apply a heuristic for the time it takes to answer the given field, in minutes. It is hand-written for now. @@ -147,7 +285,14 @@ It will factor in the input type, the answer type (slot in, gathered, third part amount of input text allowed in the field. The return value is a function that can return N samples of how long it will take to answer the field (in minutes) -#### time\_to\_answer\_form + + +#### time\_to\_answer\_form(processed\_fields, normalized\_fields) + +```python +def time_to_answer_form(processed_fields, + normalized_fields) -> Tuple[float, float] +``` Provide an estimate of how long it would take an average user to respond to the questions on the provided form. @@ -160,11 +305,26 @@ rough estimate, based on whether the field is: a. short created (3 lines or so?) b. long created (anything over 3 lines) -#### cleanup\_text + + +#### cleanup\_text(text: str, fields\_to\_sentences: bool = False) + +```python +def cleanup_text(text: str, fields_to_sentences: bool = False) -> str +``` Apply cleanup routines to text to provide more accurate readability statistics. -#### text\_complete + + +#### text\_complete(prompt: str, max\_tokens: int = 500, creds: Optional[OpenAiCreds] = None, temperature: float = 0) + +```python +def text_complete(prompt: str, + max_tokens: int = 500, + creds: Optional[OpenAiCreds] = None, + temperature: float = 0) -> str +``` Run a prompt via openAI's API and return the result. @@ -175,16 +335,51 @@ Run a prompt via openAI's API and return the result. - `creds` _Optional[OpenAiCreds], optional_ - The credentials to use. Defaults to None. - `temperature` _float, optional_ - The temperature to use. Defaults to 0. -#### complete\_with\_command + + +#### complete\_with\_command(text, command, tokens, creds: Optional[OpenAiCreds] = None) + +```python +def complete_with_command(text, + command, + tokens, + creds: Optional[OpenAiCreds] = None) -> str +``` Combines some text with a command to send to open ai. -#### needs\_calculations + + +#### needs\_calculations(text: Union[str]) + +```python +def needs_calculations(text: Union[str]) -> bool +``` A conservative guess at if a given form needs the filler to make math calculations, something that should be avoided. If -#### get\_passive\_sentences + + +#### tools\_passive(input: Union[List[str], str], tools\_token: Optional[str] = None) + +```python +def tools_passive(input: Union[List[str], str], + tools_token: Optional[str] = None) +``` + +Ping passive voice API for list of sentences using the passive voice + + + +#### get\_passive\_sentences(text: Union[List, str], tools\_token: Optional[str] = None) + +```python +def get_passive_sentences( + text: Union[List, str], + tools_token: Optional[str] = None +) -> List[Tuple[str, List[Tuple[int, int]]]] +``` Return a list of tuples, where each tuple represents a sentence in which passive voice was detected along with a list of the @@ -196,13 +391,46 @@ Text can either be a string or a list of strings. If provided a single string, it will be tokenized with NTLK and sentences containing fewer than 2 words will be ignored. -#### get\_citations + + +#### get\_citations(text: str, tokenized\_sentences: List[str]) + +```python +def get_citations(text: str, tokenized_sentences: List[str]) -> List[str] +``` Get citations and some extra surrounding context (the full sentence), if the citation is fewer than 5 characters (often eyecite only captures a section symbol for state-level short citation formats) -#### substitute\_phrases + + +#### get\_sensitive\_data\_types(fields: List[str], fields\_old: Optional[List[str]] = None) + +```python +def get_sensitive_data_types( + fields: List[str], + fields_old: Optional[List[str]] = None) -> Dict[str, List[str]] +``` + +Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive +fields grouped by type. A list of the old field names can also be provided. These fields should be in the same +order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value +will not contain the old field name, only the corresponding field name from the first parameter. + +The sensitive data types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security +Number. + + + +#### substitute\_phrases(input\_string: str, substitution\_phrases: Dict[str, str]) + +```python +def substitute_phrases( + input_string: str, + substitution_phrases: Dict[str, + str]) -> Tuple[str, List[Tuple[int, int]]] +``` Substitute phrases in the input string and return the new string and positions of substituted phrases. @@ -229,29 +457,72 @@ Substitute phrases in the input string and return the new string and positions o >>> print(positions) [(4, 17), (35, 48)] -#### substitute\_neutral\_gender + + +#### substitute\_neutral\_gender(input\_string: str) + +```python +def substitute_neutral_gender( + input_string: str) -> Tuple[str, List[Tuple[int, int]]] +``` Substitute gendered phrases with neutral phrases in the input string. Primary source is https://github.com/joelparkerhenderson/inclusive-language -#### substitute\_plain\_language + + +#### substitute\_plain\_language(input\_string: str) + +```python +def substitute_plain_language( + input_string: str) -> Tuple[str, List[Tuple[int, int]]] +``` Substitute complex phrases with simpler alternatives. Source of terms is drawn from https://www.plainlanguage.gov/guidelines/words/ -#### transformed\_sentences + + +#### transformed\_sentences(sentence\_list: List[str], fun: Callable) + +```python +def transformed_sentences( + sentence_list: List[str], + fun: Callable) -> List[Tuple[str, str, List[Tuple[int, int]]]] +``` Apply a function to a list of sentences and return only the sentences with changed terms. The result is a tuple of the original sentence, new sentence, and the starting and ending position of each changed fragment in the sentence. -#### parse\_form + + +#### parse\_form(in\_file: str, title: Optional[str] = None, jur: Optional[str] = None, cat: Optional[str] = None, normalize: bool = True, spot\_token: Optional[str] = None, tools\_token: Optional[str] = None, openai\_creds: Optional[OpenAiCreds] = None, rewrite: bool = False, debug: bool = False) + +```python +def parse_form(in_file: str, + title: Optional[str] = None, + jur: Optional[str] = None, + cat: Optional[str] = None, + normalize: bool = True, + spot_token: Optional[str] = None, + tools_token: Optional[str] = None, + openai_creds: Optional[OpenAiCreds] = None, + rewrite: bool = False, + debug: bool = False) +``` Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the in_file with the new fields (if `rewrite=1`). If you pass a spot token, we will guess the NSMI code. If you pass openai creds, we will give suggestions for the title and description. -#### form\_complexity + + +#### form\_complexity(stats) + +```python +def form_complexity(stats) +``` Gets a single number of how hard the form is to complete. Higher is harder. diff --git a/docs/components/FormFyxer/pdf_wrangling.md b/docs/components/FormFyxer/pdf_wrangling.md index b5dca6cec..7268a80d7 100644 --- a/docs/components/FormFyxer/pdf_wrangling.md +++ b/docs/components/FormFyxer/pdf_wrangling.md @@ -1,30 +1,75 @@ +# Table of Contents + +* [formfyxer.pdf\_wrangling](#formfyxer.pdf_wrangling) + * [FieldType](#formfyxer.pdf_wrangling.FieldType) + * [TEXT](#formfyxer.pdf_wrangling.FieldType.TEXT) + * [AREA](#formfyxer.pdf_wrangling.FieldType.AREA) + * [LIST\_BOX](#formfyxer.pdf_wrangling.FieldType.LIST_BOX) + * [CHOICE](#formfyxer.pdf_wrangling.FieldType.CHOICE) + * [FormField](#formfyxer.pdf_wrangling.FormField) + * [\_\_init\_\_](#formfyxer.pdf_wrangling.FormField.__init__) + * [set\_fields](#formfyxer.pdf_wrangling.set_fields) + * [rename\_pdf\_fields](#formfyxer.pdf_wrangling.rename_pdf_fields) + * [unlock\_pdf\_in\_place](#formfyxer.pdf_wrangling.unlock_pdf_in_place) + * [has\_fields](#formfyxer.pdf_wrangling.has_fields) + * [get\_existing\_pdf\_fields](#formfyxer.pdf_wrangling.get_existing_pdf_fields) + * [swap\_pdf\_page](#formfyxer.pdf_wrangling.swap_pdf_page) + * [copy\_pdf\_fields](#formfyxer.pdf_wrangling.copy_pdf_fields) + * [get\_textboxes\_in\_pdf](#formfyxer.pdf_wrangling.get_textboxes_in_pdf) + * [get\_bracket\_chars\_in\_pdf](#formfyxer.pdf_wrangling.get_bracket_chars_in_pdf) + * [intersect\_bbox](#formfyxer.pdf_wrangling.intersect_bbox) + * [intersect\_bboxs](#formfyxer.pdf_wrangling.intersect_bboxs) + * [contain\_boxes](#formfyxer.pdf_wrangling.contain_boxes) + * [get\_dist\_sq](#formfyxer.pdf_wrangling.get_dist_sq) + * [get\_dist](#formfyxer.pdf_wrangling.get_dist) + * [get\_connected\_edges](#formfyxer.pdf_wrangling.get_connected_edges) + * [bbox\_distance](#formfyxer.pdf_wrangling.bbox_distance) + * [get\_possible\_fields](#formfyxer.pdf_wrangling.get_possible_fields) + * [get\_possible\_checkboxes](#formfyxer.pdf_wrangling.get_possible_checkboxes) + * [get\_possible\_radios](#formfyxer.pdf_wrangling.get_possible_radios) + * [get\_possible\_text\_fields](#formfyxer.pdf_wrangling.get_possible_text_fields) + * [auto\_add\_fields](#formfyxer.pdf_wrangling.auto_add_fields) + * [is\_tagged](#formfyxer.pdf_wrangling.is_tagged) + --- sidebar_label: pdf_wrangling title: formfyxer.pdf_wrangling --- + + ## FieldType Objects ```python class FieldType(Enum) ``` + + #### TEXT Text input Field + + #### AREA Text input Field, but an area + + #### LIST\_BOX allows multiple selection + + #### CHOICE allows only one selection + + ## FormField Objects ```python @@ -33,7 +78,19 @@ class FormField() A data holding class, used to easily specify how a PDF form field should be created. -#### \_\_init\_\_ + + +#### \_\_init\_\_(field\_name: str, type\_name: Union[FieldType, str], x: int, y: int, font\_size: Optional[int] = None, tooltip: str = "", configs: Optional[Dict[str, Any]] = None) + +```python +def __init__(field_name: str, + type_name: Union[FieldType, str], + x: int, + y: int, + font_size: Optional[int] = None, + tooltip: str = "", + configs: Optional[Dict[str, Any]] = None) +``` Constructor @@ -50,7 +107,17 @@ Constructor [reportlab User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf) - `field_name` - the name of the field, exposed to via most APIs. Not the tooltip, but `users1_name__0` -#### set\_fields + + +#### set\_fields(in\_file: Union[str, Path, BinaryIO], out\_file: Union[str, Path, BinaryIO], fields\_per\_page: Iterable[Iterable[FormField]], \*, overwrite=False) + +```python +def set_fields(in_file: Union[str, Path, BinaryIO], + out_file: Union[str, Path, BinaryIO], + fields_per_page: Iterable[Iterable[FormField]], + *, + overwrite=False) +``` Adds fields per page to the in_file PDF, writing the new PDF to a new file. @@ -90,7 +157,15 @@ set_fields('no_fields.pdf', 'four_fields_on_second_page.pdf', Nothing. -#### rename\_pdf\_fields + + +#### rename\_pdf\_fields(in\_file: Union[str, Path, BinaryIO], out\_file: Union[str, Path, BinaryIO], mapping: Mapping[str, str]) + +```python +def rename_pdf_fields(in_file: Union[str, Path, BinaryIO], + out_file: Union[str, Path, BinaryIO], + mapping: Mapping[str, str]) -> None +``` Given a dictionary that maps old to new field names, rename the AcroForm field with a matching key to the specified value. @@ -110,11 +185,23 @@ Args: Returns: Nothing -#### unlock\_pdf\_in\_place + + +#### unlock\_pdf\_in\_place(in\_file: Union[str, Path, BinaryIO]) + +```python +def unlock_pdf_in_place(in_file: Union[str, Path, BinaryIO]) -> None +``` Try using pikePDF to unlock the PDF it it is locked. This won't work if it has a non-zero length password. -#### has\_fields + + +#### has\_fields(pdf\_file: str) + +```python +def has_fields(pdf_file: str) -> bool +``` Check if a PDF has at least one form field using PikePDF. @@ -127,17 +214,46 @@ Check if a PDF has at least one form field using PikePDF. - `bool` - True if the PDF has at least one form field, False otherwise. -#### get\_existing\_pdf\_fields + + +#### get\_existing\_pdf\_fields(in\_file: Union[str, Path, BinaryIO, Pdf]) + +```python +def get_existing_pdf_fields( + in_file: Union[str, Path, BinaryIO, Pdf]) -> List[List[FormField]] +``` Use PikePDF to get fields from the PDF -#### swap\_pdf\_page + + +#### swap\_pdf\_page(\*, source\_pdf: Union[str, Path, Pdf], destination\_pdf: Union[str, Path, Pdf], source\_offset: int = 0, destination\_offset: int = 0, append\_fields: bool = False) + +```python +def swap_pdf_page(*, + source_pdf: Union[str, Path, Pdf], + destination_pdf: Union[str, Path, Pdf], + source_offset: int = 0, + destination_offset: int = 0, + append_fields: bool = False) -> Pdf +``` (DEPRECATED: use copy_pdf_fields) Copies the AcroForm fields from one PDF to another blank PDF form. Optionally, choose a starting page for both the source and destination PDFs. By default, it will remove any existing annotations (which include form fields) in the destination PDF. If you wish to append annotations instead, specify `append_fields = True` -#### copy\_pdf\_fields + + +#### copy\_pdf\_fields(\*, source\_pdf: Union[str, Path, Pdf], destination\_pdf: Union[str, Path, Pdf], source\_offset: int = 0, destination\_offset: int = 0, append\_fields: bool = False) + +```python +def copy_pdf_fields(*, + source_pdf: Union[str, Path, Pdf], + destination_pdf: Union[str, Path, Pdf], + source_offset: int = 0, + destination_offset: int = 0, + append_fields: bool = False) -> Pdf +``` Copies the AcroForm fields from one PDF to another blank PDF form (without AcroForm fields). Useful for getting started with an updated PDF form, where the old fields are pretty close to where @@ -172,47 +288,106 @@ new_pdf_with_fields.save("new_pdf_with_fields.pdf") A pikepdf.Pdf object with new fields. If `blank_pdf` was a pikepdf.Pdf object, the same object is returned. -#### get\_original\_text\_with\_fields + -Gets the original text of the document, with the names of the fields in jinja format (\{\{field_name\}\}) +#### get\_textboxes\_in\_pdf(in\_file: Union[str, Path, BinaryIO], line\_margin=0.02, char\_margin=2.0) -#### get\_textboxes\_in\_pdf +```python +def get_textboxes_in_pdf(in_file: Union[str, Path, BinaryIO], + line_margin=0.02, + char_margin=2.0) -> List[List[Textbox]] +``` Gets all of the text boxes found by pdfminer in a PDF, as well as their bounding boxes -#### get\_bracket\_chars\_in\_pdf + + +#### get\_bracket\_chars\_in\_pdf(in\_file: Union[str, Path, BinaryIO], line\_margin=0.02, char\_margin=0.0) + +```python +def get_bracket_chars_in_pdf(in_file: Union[str, Path, BinaryIO], + line_margin=0.02, + char_margin=0.0) -> List +``` Gets all of the bracket characters ('[' and ']') found by pdfminer in a PDF, as well as their bounding boxes TODO: Will eventually be used to find [ ] as checkboxes, but right now we can't tell the difference between [ ] and [i]. This simply gets all of the brackets, and the characters of [hi] in a PDF and [ ] are the exact same distance apart. Currently going with just "[hi]" doesn't happen, let's hope that assumption holds. -#### intersect\_bbox + + +#### intersect\_bbox(bbox\_a, bbox\_b, vert\_dilation=2, horiz\_dilation=2) + +```python +def intersect_bbox(bbox_a, bbox_b, vert_dilation=2, horiz_dilation=2) -> bool +``` bboxes are [left edge, bottom edge, horizontal length, vertical length] -#### intersect\_bboxs + + +#### intersect\_bboxs(bbox\_a, bboxes, vert\_dilation=2, horiz\_dilation=2) + +```python +def intersect_bboxs(bbox_a, + bboxes, + vert_dilation=2, + horiz_dilation=2) -> Iterable[bool] +``` Returns an iterable of booleans, one of each of the input bboxes, true if it collides with bbox_a -#### contain\_boxes + + +#### contain\_boxes(bbox\_a: BoundingBoxF, bbox\_b: BoundingBoxF) + +```python +def contain_boxes(bbox_a: BoundingBoxF, bbox_b: BoundingBoxF) -> BoundingBoxF +``` Given two bounding boxes, return a single bounding box that contains both of them. -#### get\_dist\_sq + + +#### get\_dist\_sq(point\_a: XYPair, point\_b: XYPair) + +```python +def get_dist_sq(point_a: XYPair, point_b: XYPair) -> float +``` returns the distance squared between two points. Faster than the true euclidean dist -#### get\_dist + + +#### get\_dist(point\_a: XYPair, point\_b: XYPair) + +```python +def get_dist(point_a: XYPair, point_b: XYPair) -> float +``` euclidean (L^2 norm) distance between two points -#### get\_connected\_edges + + +#### get\_connected\_edges(point: XYPair, point\_list: Sequence) + +```python +def get_connected_edges(point: XYPair, point_list: Sequence) +``` point list is always ordered clockwise from the bottom left, i.e. bottom left, top left, top right, bottom right -#### bbox\_distance + + +#### bbox\_distance(bbox\_a: BoundingBoxF, bbox\_b: BoundingBoxF) + +```python +def bbox_distance( + bbox_a: BoundingBoxF, bbox_b: BoundingBoxF +) -> Tuple[float, Tuple[XYPair, XYPair], Tuple[XYPair, XYPair]] +``` Gets our specific "distance measure" between two different bounding boxes. This distance is roughly the sum of the horizontal and vertical difference in alignment of @@ -221,7 +396,16 @@ around a field, is the most likely to be the actual text label for the PDF field bboxes are 4 floats, x, y, width and height -#### get\_possible\_fields + + +#### get\_possible\_fields(in\_pdf\_file: Union[str, Path], textboxes: Optional[List[List[Textbox]]] = None) + +```python +def get_possible_fields( + in_pdf_file: Union[str, Path], + textboxes: Optional[List[List[Textbox]]] = None +) -> List[List[FormField]] +``` Given an input PDF, runs a series of heuristics to predict where there might be places for user enterable information (i.e. PDF fields), and returns @@ -248,33 +432,42 @@ print(fields[0][0]) For each page in the input PDF, a list of predicted form fields -## LowestVertVisitor Objects + + +#### get\_possible\_checkboxes(img: Union[str, cv2.Mat], find\_small=False) ```python -class LowestVertVisitor() +def get_possible_checkboxes(img: Union[str, cv2.Mat], + find_small=False) -> Union[np.ndarray, List] ``` -Gets just the closest text to the field, and returns that - -#### replace\_in\_original - -Given the original text of a PDF (extract_text(...)), adds the field's names in their best places. -Doesn't always work, especially with duplicate text. - -#### get\_possible\_checkboxes - Uses boxdetect library to determine if there are checkboxes on an image of a PDF page. Assumes the checkbox is square. find_small: if true, finds smaller checkboxes. Sometimes will "find" a checkbox in letters, like O and D, if the font is too small -#### get\_possible\_radios + + +#### get\_possible\_radios(img: Union[str, BinaryIO, cv2.Mat]) + +```python +def get_possible_radios(img: Union[str, BinaryIO, cv2.Mat]) +``` Even though it's called "radios", it just gets things shaped like circles, not doing any semantic analysis yet. -#### get\_possible\_text\_fields + + +#### get\_possible\_text\_fields(img: Union[str, BinaryIO, cv2.Mat], text\_lines: List[Textbox], default\_line\_height: int = 44) + +```python +def get_possible_text_fields( + img: Union[str, BinaryIO, cv2.Mat], + text_lines: List[Textbox], + default_line_height: int = 44) -> List[Tuple[BoundingBox, int]] +``` Uses openCV to attempt to find places where a PDF could expect an input text field. @@ -283,10 +476,17 @@ Won't find field inputs as boxes default_line_height: the default height (16 pt), in pixels (at 200 dpi), which is 45 -#### auto\_add\_fields + + +#### auto\_add\_fields(in\_pdf\_file: Union[str, Path], out\_pdf\_file: Union[str, Path]) + +```python +def auto_add_fields(in_pdf_file: Union[str, Path], out_pdf_file: Union[str, + Path]) +``` -Uses [get_possible_fields](#get_possible_fields) and -[set_fields](#set_fields) to automatically add new detected fields +Uses [get_possible_fields](#formfyxer.pdf_wrangling.get_possible_fields) and +[set_fields](#formfyxer.pdf_wrangling.set_fields) to automatically add new detected fields to an input PDF. **Example**: @@ -308,7 +508,13 @@ auto_add_fields('no_fields.pdf', 'newly_added_fields.pdf') Nothing -#### is\_tagged + + +#### is\_tagged(in\_pdf\_file: Union[str, Path, pikepdf.Pdf]) + +```python +def is_tagged(in_pdf_file: Union[str, Path, pikepdf.Pdf]) -> bool +``` Determines if the input PDF file is tagged for accessibility. diff --git a/docs/components/formfyxer/docx_wrangling.md b/docs/components/formfyxer/docx_wrangling.md deleted file mode 100644 index a444a2602..000000000 --- a/docs/components/formfyxer/docx_wrangling.md +++ /dev/null @@ -1,185 +0,0 @@ -# Table of Contents - -* [formfyxer.docx\_wrangling](#formfyxer.docx_wrangling) - * [update\_docx](#formfyxer.docx_wrangling.update_docx) - * [get\_docx\_repr](#formfyxer.docx_wrangling.get_docx_repr) - * [get\_labeled\_docx\_runs](#formfyxer.docx_wrangling.get_labeled_docx_runs) - * [get\_modified\_docx\_runs](#formfyxer.docx_wrangling.get_modified_docx_runs) - * [make\_docx\_plain\_language](#formfyxer.docx_wrangling.make_docx_plain_language) - * [modify\_docx\_with\_openai\_guesses](#formfyxer.docx_wrangling.modify_docx_with_openai_guesses) - ---- -sidebar_label: docx_wrangling -title: formfyxer.docx_wrangling ---- - - - -#### update\_docx(document: Union[docx.document.Document, str], modified\_runs: List[Tuple[int, int, str, int]]) - -```python -def update_docx( - document: Union[docx.document.Document, str], - modified_runs: List[Tuple[int, int, str, - int]]) -> docx.document.Document -``` - -Update the document with the modified runs. - -Note: OpenAI is probabilistic, so the modified run indices may not be correct. -When the index of a run or paragraph is out of range, a new paragraph -will be inserted at the end of the document or a new run at the end of the -paragraph's runs. - -Take a careful look at the output document to make sure it is still correct. - -**Arguments**: - -- `document` - the docx.Document object, or the path to the DOCX file -- `modified_runs` - a tuple of paragraph number, run number, the modified text, a question (not used), and whether a new paragraph should be inserted (for conditional text) - - -**Returns**: - - The modified document. - - - -#### get\_docx\_repr(docx\_path: str, paragraph\_start: int = 0, paragraph\_end: Optional[int] = None) - -```python -def get_docx_repr(docx_path: str, - paragraph_start: int = 0, - paragraph_end: Optional[int] = None) -``` - -Return a JSON representation of the paragraphs and runs in the DOCX file. - -**Arguments**: - -- `docx_path` - path to the DOCX file - - -**Returns**: - - A JSON representation of the paragraphs and runs in the DOCX file. - - - -#### get\_labeled\_docx\_runs(docx\_path: Optional[str] = None, docx\_repr=Optional[str], custom\_people\_names: Optional[Tuple[str, str]] = None, openai\_client: Optional[OpenAI] = None, api\_key: Optional[str] = None) - -```python -def get_labeled_docx_runs( - docx_path: Optional[str] = None, - docx_repr=Optional[str], - custom_people_names: Optional[Tuple[str, str]] = None, - openai_client: Optional[OpenAI] = None, - api_key: Optional[str] = None) -> List[Tuple[int, int, str, int]] -``` - -Scan the DOCX and return a list of modified text with Jinja2 variable names inserted. - -**Arguments**: - -- `docx_path` - path to the DOCX file -- `docx_repr` - a string representation of the paragraphs and runs in the DOCX file, if docx_path is not provided. This might be useful if you want -- `custom_people_names` - a tuple of custom names and descriptions to use in addition to the default ones. Like: ("clients", "the person benefiting from the form") - - -**Returns**: - - A list of tuples, each containing a paragraph number, run number, and the modified text of the run. - - - -#### get\_modified\_docx\_runs(docx\_path: Optional[str] = None, docx\_repr: Optional[str] = None, custom\_example: str = "", instructions: str = "", openai\_client: Optional[OpenAI] = None, api\_key: Optional[str] = None, temperature=0.5) - -```python -def get_modified_docx_runs(docx_path: Optional[str] = None, - docx_repr: Optional[str] = None, - custom_example: str = "", - instructions: str = "", - openai_client: Optional[OpenAI] = None, - api_key: Optional[str] = None, - temperature=0.5) -> List[Tuple[int, int, str, int]] -``` - -Use GPT to rewrite the contents of a DOCX file paragraph by paragraph. Does not handle tables, footers, or -other structures yet. - -This is a light wrapper that provides the structure of DOCX paragraphs and runs to your prompt -to OpenAI to facilitate the rewriting of the document without disrupting formatting. - -For example, this could be used to: -* Remove any passive voice -* Replace placeholder text with variable names -* Rewrite to a 6th grade reading level -* Do an advanced search and replace, without requiring you to use a regex - -By default, the example prompt includes a sample like this: - -[ -[0, 0, "Dear "], -[0, 1, "John Smith:"], -[1, 0, "I hope this letter finds you well."], -] - -Your custom instructions should include an example of how the sample will be modified, like the one below: - -Example reply, indicating paragraph, run, the new text, and a number indicating if this changes the -current paragraph, adds one before, or adds one after (-1, 0, 1): - -\{"results": -[ -[0, 1, "Dear \{\{ other_parties[0] \}\}:", 0], -[2, 0, "\{%p if is_tenant %\}", -1], -[3, 0, "\{%p endif %\}", 1], -] -\} - -You may also want to customize the input example to better match your use case. - -**Arguments**: - -- `docx_path` _str_ - path to the DOCX file -- `docx_repr` _str_ - a string representation of the paragraphs and runs in the DOCX file, if docx_path is not provided. -- `custom_example` _Optional[str]_ - a string containing the purpose and overview of the task - instructions (str) a string containing specific instructions for the task -- `openai_client` _Optional[OpenAI]_ - an OpenAI client object. If not provided a new one will be created. -- `api_key` _Optional[str]_ - an OpenAI API key. If not provided, it will be obtained from the environment -- `temperature` _float_ - the temperature to use when generating text. Lower temperatures are more conservative. - - -**Returns**: - - A list of tuples, each containing a paragraph number, run number, and the modified text of the run. - - - -#### make\_docx\_plain\_language(docx\_path: str) - -```python -def make_docx_plain_language(docx_path: str) -> docx.document.Document -``` - -Convert a DOCX file to plain language with the help of OpenAI. - - - -#### modify\_docx\_with\_openai\_guesses(docx\_path: str) - -```python -def modify_docx_with_openai_guesses(docx_path: str) -> docx.document.Document -``` - -Uses OpenAI to guess the variable names for a document and then modifies the document with the guesses. - -**Arguments**: - -- `docx_path` _str_ - Path to the DOCX file to modify. - - -**Returns**: - -- `docx.Document` - The modified document, ready to be saved to the same or a new path - diff --git a/docs/components/formfyxer/lit_explorer.md b/docs/components/formfyxer/lit_explorer.md deleted file mode 100644 index bd2548754..000000000 --- a/docs/components/formfyxer/lit_explorer.md +++ /dev/null @@ -1,528 +0,0 @@ -# Table of Contents - -* [formfyxer.lit\_explorer](#formfyxer.lit_explorer) - * [recursive\_get\_id](#formfyxer.lit_explorer.recursive_get_id) - * [spot](#formfyxer.lit_explorer.spot) - * [re\_case](#formfyxer.lit_explorer.re_case) - * [regex\_norm\_field](#formfyxer.lit_explorer.regex_norm_field) - * [reformat\_field](#formfyxer.lit_explorer.reformat_field) - * [norm](#formfyxer.lit_explorer.norm) - * [vectorize](#formfyxer.lit_explorer.vectorize) - * [normalize\_name](#formfyxer.lit_explorer.normalize_name) - * [cluster\_screens](#formfyxer.lit_explorer.cluster_screens) - * [InputType](#formfyxer.lit_explorer.InputType) - * [field\_types\_and\_sizes](#formfyxer.lit_explorer.field_types_and_sizes) - * [AnswerType](#formfyxer.lit_explorer.AnswerType) - * [classify\_field](#formfyxer.lit_explorer.classify_field) - * [get\_adjusted\_character\_count](#formfyxer.lit_explorer.get_adjusted_character_count) - * [time\_to\_answer\_field](#formfyxer.lit_explorer.time_to_answer_field) - * [time\_to\_answer\_form](#formfyxer.lit_explorer.time_to_answer_form) - * [cleanup\_text](#formfyxer.lit_explorer.cleanup_text) - * [text\_complete](#formfyxer.lit_explorer.text_complete) - * [complete\_with\_command](#formfyxer.lit_explorer.complete_with_command) - * [needs\_calculations](#formfyxer.lit_explorer.needs_calculations) - * [tools\_passive](#formfyxer.lit_explorer.tools_passive) - * [get\_passive\_sentences](#formfyxer.lit_explorer.get_passive_sentences) - * [get\_citations](#formfyxer.lit_explorer.get_citations) - * [get\_sensitive\_data\_types](#formfyxer.lit_explorer.get_sensitive_data_types) - * [substitute\_phrases](#formfyxer.lit_explorer.substitute_phrases) - * [substitute\_neutral\_gender](#formfyxer.lit_explorer.substitute_neutral_gender) - * [substitute\_plain\_language](#formfyxer.lit_explorer.substitute_plain_language) - * [transformed\_sentences](#formfyxer.lit_explorer.transformed_sentences) - * [parse\_form](#formfyxer.lit_explorer.parse_form) - * [form\_complexity](#formfyxer.lit_explorer.form_complexity) - ---- -sidebar_label: lit_explorer -title: formfyxer.lit_explorer ---- - - - -#### recursive\_get\_id(values\_to\_unpack: Union[dict, list], tmpl: Optional[set] = None) - -```python -def recursive_get_id(values_to_unpack: Union[dict, list], - tmpl: Optional[set] = None) -``` - -Pull ID values out of the LIST/NSMI results from Spot. - - - -#### spot(text: str, lower: float = 0.25, pred: float = 0.5, upper: float = 0.6, verbose: float = 0, token: str = "") - -```python -def spot(text: str, - lower: float = 0.25, - pred: float = 0.5, - upper: float = 0.6, - verbose: float = 0, - token: str = "") -``` - -Call the Spot API (https://spot.suffolklitlab.org) to classify the text of a PDF using -the NSMIv2/LIST taxonomy (https://taxonomy.legal/), but returns only the IDs of issues found in the text. - - - -#### re\_case(text: str) - -```python -def re_case(text: str) -> str -``` - -Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words - - - -#### regex\_norm\_field(text: str) - -```python -def regex_norm_field(text: str) -``` - -Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions. -See: https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/document_variables - - - -#### reformat\_field(text: str, max\_length: int = 30, tools\_token: Optional[str] = None) - -```python -def reformat_field(text: str, - max_length: int = 30, - tools_token: Optional[str] = None) -``` - -Transforms a string of text into a snake_case variable close in length to `max_length` name by -summarizing the string and stitching the summary together in snake_case. -h/t https://towardsdatascience.com/nlp-building-a-summariser-68e0c19e3a93 - - - -#### norm(row) - -```python -def norm(row) -``` - -Normalize a word vector. - - - -#### vectorize(text: Union[List[str], str], tools\_token: Optional[str] = None) - -```python -def vectorize(text: Union[List[str], str], tools_token: Optional[str] = None) -``` - -Vectorize a string of text. - -**Arguments**: - -- `text` - a string of multiple words to vectorize -- `tools_token` - the token to tools.suffolklitlab.org, used for micro-service - to reduce the amount of memory you need on your machine. If - not passed, you need to have `en_core_web_lg` installed. NOTE: this - last bit is nolonger correct, you have to use the micor-service - as we have had to remove SpaCY due to a breaking change - - - -#### normalize\_name(jur: str, group: str, n: int, per, last\_field: str, this\_field: str, tools\_token: Optional[str] = None) - -```python -def normalize_name(jur: str, - group: str, - n: int, - per, - last_field: str, - this_field: str, - tools_token: Optional[str] = None) -> Tuple[str, float] -``` - -Normalize a field name, if possible to the Assembly Line conventions, and if -not, to a snake_case variable name of appropriate length. - -HACK: temporarily all we do is re-case it and normalize it using regex rules. -Will be replaced with call to LLM soon. - - - -#### cluster\_screens(fields: List[str] = [], damping: float = 0.7, tools\_token: Optional[str] = None) - -```python -def cluster_screens(fields: List[str] = [], - damping: float = 0.7, - tools_token: Optional[str] = None) -> Dict[str, List[str]] -``` - -Groups the given fields into screens based on how much they are related. - -**Arguments**: - -- `fields` - a list of field names -- `damping` - a value >= 0.5 and < 1. Tunes how related screens should be -- `tools_token` - the token to tools.suffolklitlab.org, needed of doing - micro-service vectorization - -- `Returns` - a suggested screen grouping, each screen name mapped to the list of fields on it - - - -## InputType Objects - -```python -class InputType(Enum) -``` - -Input type maps onto the type of input the PDF author chose for the field. We only -handle text, checkbox, and signature fields. - - - -#### field\_types\_and\_sizes(fields: Optional[Iterable[FormField]]) - -```python -def field_types_and_sizes( - fields: Optional[Iterable[FormField]]) -> List[FieldInfo] -``` - -Transform the fields provided by get_existing_pdf_fields into a summary format. -Result will look like: -[ -\{ -"var_name": var_name, -"type": "text | checkbox | signature", -"max_length": n -\} -] - - - -## AnswerType Objects - -```python -class AnswerType(Enum) -``` - -Answer type describes the effort the user answering the form will require. -"Slot-in" answers are a matter of almost instantaneous recall, e.g., name, address, etc. -"Gathered" answers require looking around one's desk, for e.g., a health insurance number. -"Third party" answers require picking up the phone to call someone else who is the keeper -of the information. -"Created" answers don't exist before the user is presented with the question. They may include -a choice, creating a narrative, or even applying legal reasoning. "Affidavits" are a special -form of created answers. -See Jarret and Gaffney, Forms That Work (2008) - - - -#### classify\_field(field: FieldInfo, new\_name: str) - -```python -def classify_field(field: FieldInfo, new_name: str) -> AnswerType -``` - -Apply heuristics to the field's original and "normalized" name to classify -it as either a "slot-in", "gathered", "third party" or "created" field type. - - - -#### get\_adjusted\_character\_count(field: FieldInfo) - -```python -def get_adjusted_character_count(field: FieldInfo) -> float -``` - -Determines the bracketed length of an input field based on its max_length attribute, -returning a float representing the approximate length of the field content. - -The function chunks the answers into 5 different lengths (checkboxes, 2 words, short, medium, and long) -instead of directly using the character count, as forms can allocate different spaces -for the same data without considering the space the user actually needs. - -**Arguments**: - -- `field` _FieldInfo_ - An object containing information about the input field, - including the "max_length" attribute. - - -**Returns**: - -- `float` - The approximate length of the field content, categorized into checkboxes, 2 words, short, - medium, or long based on the max_length attribute. - - -**Examples**: - - >>> get_adjusted_character_count(\{"type"\}: InputType.CHECKBOX) - 4.7 - >>> get_adjusted_character_count(\{"max_length": 100\}) - 9.4 - >>> get_adjusted_character_count(\{"max_length": 300\}) - 230 - >>> get_adjusted_character_count(\{"max_length": 600\}) - 115 - >>> get_adjusted_character_count(\{"max_length": 1200\}) - 1150 - - - -#### time\_to\_answer\_field(field: FieldInfo, new\_name: str, cpm: int = 40, cpm\_std\_dev: int = 17) - -```python -def time_to_answer_field(field: FieldInfo, - new_name: str, - cpm: int = 40, - cpm_std_dev: int = 17) -> Callable[[int], np.ndarray] -``` - -Apply a heuristic for the time it takes to answer the given field, in minutes. -It is hand-written for now. -It will factor in the input type, the answer type (slot in, gathered, third party or created), and the -amount of input text allowed in the field. -The return value is a function that can return N samples of how long it will take to answer the field (in minutes) - - - -#### time\_to\_answer\_form(processed\_fields, normalized\_fields) - -```python -def time_to_answer_form(processed_fields, - normalized_fields) -> Tuple[float, float] -``` - -Provide an estimate of how long it would take an average user to respond to the questions -on the provided form. -We use signals such as the field type, name, and space provided for the response to come up with a -rough estimate, based on whether the field is: -1. fill in the blank -2. gathered - e.g., an id number, case number, etc. -3. third party: need to actually ask someone the information - e.g., income of not the user, anything else? -4. created: -a. short created (3 lines or so?) -b. long created (anything over 3 lines) - - - -#### cleanup\_text(text: str, fields\_to\_sentences: bool = False) - -```python -def cleanup_text(text: str, fields_to_sentences: bool = False) -> str -``` - -Apply cleanup routines to text to provide more accurate readability statistics. - - - -#### text\_complete(prompt: str, max\_tokens: int = 500, creds: Optional[OpenAiCreds] = None, temperature: float = 0) - -```python -def text_complete(prompt: str, - max_tokens: int = 500, - creds: Optional[OpenAiCreds] = None, - temperature: float = 0) -> str -``` - -Run a prompt via openAI's API and return the result. - -**Arguments**: - -- `prompt` _str_ - The prompt to send to the API. -- `max_tokens` _int, optional_ - The number of tokens to generate. Defaults to 500. -- `creds` _Optional[OpenAiCreds], optional_ - The credentials to use. Defaults to None. -- `temperature` _float, optional_ - The temperature to use. Defaults to 0. - - - -#### complete\_with\_command(text, command, tokens, creds: Optional[OpenAiCreds] = None) - -```python -def complete_with_command(text, - command, - tokens, - creds: Optional[OpenAiCreds] = None) -> str -``` - -Combines some text with a command to send to open ai. - - - -#### needs\_calculations(text: Union[str]) - -```python -def needs_calculations(text: Union[str]) -> bool -``` - -A conservative guess at if a given form needs the filler to make math calculations, -something that should be avoided. If - - - -#### tools\_passive(input: Union[List[str], str], tools\_token: Optional[str] = None) - -```python -def tools_passive(input: Union[List[str], str], - tools_token: Optional[str] = None) -``` - -Ping passive voice API for list of sentences using the passive voice - - - -#### get\_passive\_sentences(text: Union[List, str], tools\_token: Optional[str] = None) - -```python -def get_passive_sentences( - text: Union[List, str], - tools_token: Optional[str] = None -) -> List[Tuple[str, List[Tuple[int, int]]]] -``` - -Return a list of tuples, where each tuple represents a -sentence in which passive voice was detected along with a list of the -starting and ending position of each fragment that is phrased in the passive voice. -The combination of the two can be used in the PDFStats frontend to highlight the -passive text in an individual sentence. - -Text can either be a string or a list of strings. -If provided a single string, it will be tokenized with NTLK and -sentences containing fewer than 2 words will be ignored. - - - -#### get\_citations(text: str, tokenized\_sentences: List[str]) - -```python -def get_citations(text: str, tokenized_sentences: List[str]) -> List[str] -``` - -Get citations and some extra surrounding context (the full sentence), if the citation is -fewer than 5 characters (often eyecite only captures a section symbol -for state-level short citation formats) - - - -#### get\_sensitive\_data\_types(fields: List[str], fields\_old: Optional[List[str]] = None) - -```python -def get_sensitive_data_types( - fields: List[str], - fields_old: Optional[List[str]] = None) -> Dict[str, List[str]] -``` - -Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive -fields grouped by type. A list of the old field names can also be provided. These fields should be in the same -order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value -will not contain the old field name, only the corresponding field name from the first parameter. - -The sensitive data types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security -Number. - - - -#### substitute\_phrases(input\_string: str, substitution\_phrases: Dict[str, str]) - -```python -def substitute_phrases( - input_string: str, - substitution_phrases: Dict[str, - str]) -> Tuple[str, List[Tuple[int, int]]] -``` - -Substitute phrases in the input string and return the new string and positions of substituted phrases. - -**Arguments**: - -- `input_string` _str_ - The input string containing phrases to be replaced. -- `substitution_phrases` _Dict[str, str]_ - A dictionary mapping original phrases to their replacement phrases. - - -**Returns**: - - Tuple[str, List[Tuple[int, int]]]: A tuple containing the new string with substituted phrases and a list of - tuples, each containing the start and end positions of the substituted - phrases in the new string. - - -**Example**: - - >>> input_string = "The quick brown fox jumped over the lazy dog." - >>> substitution_phrases = \{"quick brown": "swift reddish", "lazy dog": "sleepy canine"\} - >>> new_string, positions = substitute_phrases(input_string, substitution_phrases) - >>> print(new_string) - "The swift reddish fox jumped over the sleepy canine." - >>> print(positions) - [(4, 17), (35, 48)] - - - -#### substitute\_neutral\_gender(input\_string: str) - -```python -def substitute_neutral_gender( - input_string: str) -> Tuple[str, List[Tuple[int, int]]] -``` - -Substitute gendered phrases with neutral phrases in the input string. -Primary source is https://github.com/joelparkerhenderson/inclusive-language - - - -#### substitute\_plain\_language(input\_string: str) - -```python -def substitute_plain_language( - input_string: str) -> Tuple[str, List[Tuple[int, int]]] -``` - -Substitute complex phrases with simpler alternatives. -Source of terms is drawn from https://www.plainlanguage.gov/guidelines/words/ - - - -#### transformed\_sentences(sentence\_list: List[str], fun: Callable) - -```python -def transformed_sentences( - sentence_list: List[str], - fun: Callable) -> List[Tuple[str, str, List[Tuple[int, int]]]] -``` - -Apply a function to a list of sentences and return only the sentences with changed terms. -The result is a tuple of the original sentence, new sentence, and the starting and ending position -of each changed fragment in the sentence. - - - -#### parse\_form(in\_file: str, title: Optional[str] = None, jur: Optional[str] = None, cat: Optional[str] = None, normalize: bool = True, spot\_token: Optional[str] = None, tools\_token: Optional[str] = None, openai\_creds: Optional[OpenAiCreds] = None, rewrite: bool = False, debug: bool = False) - -```python -def parse_form(in_file: str, - title: Optional[str] = None, - jur: Optional[str] = None, - cat: Optional[str] = None, - normalize: bool = True, - spot_token: Optional[str] = None, - tools_token: Optional[str] = None, - openai_creds: Optional[OpenAiCreds] = None, - rewrite: bool = False, - debug: bool = False) -``` - -Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the -in_file with the new fields (if `rewrite=1`). If you pass a spot token, we will guess the -NSMI code. If you pass openai creds, we will give suggestions for the title and description. - - - -#### form\_complexity(stats) - -```python -def form_complexity(stats) -``` - -Gets a single number of how hard the form is to complete. Higher is harder. - diff --git a/docs/components/formfyxer/pdf_wrangling.md b/docs/components/formfyxer/pdf_wrangling.md deleted file mode 100644 index 7268a80d7..000000000 --- a/docs/components/formfyxer/pdf_wrangling.md +++ /dev/null @@ -1,529 +0,0 @@ -# Table of Contents - -* [formfyxer.pdf\_wrangling](#formfyxer.pdf_wrangling) - * [FieldType](#formfyxer.pdf_wrangling.FieldType) - * [TEXT](#formfyxer.pdf_wrangling.FieldType.TEXT) - * [AREA](#formfyxer.pdf_wrangling.FieldType.AREA) - * [LIST\_BOX](#formfyxer.pdf_wrangling.FieldType.LIST_BOX) - * [CHOICE](#formfyxer.pdf_wrangling.FieldType.CHOICE) - * [FormField](#formfyxer.pdf_wrangling.FormField) - * [\_\_init\_\_](#formfyxer.pdf_wrangling.FormField.__init__) - * [set\_fields](#formfyxer.pdf_wrangling.set_fields) - * [rename\_pdf\_fields](#formfyxer.pdf_wrangling.rename_pdf_fields) - * [unlock\_pdf\_in\_place](#formfyxer.pdf_wrangling.unlock_pdf_in_place) - * [has\_fields](#formfyxer.pdf_wrangling.has_fields) - * [get\_existing\_pdf\_fields](#formfyxer.pdf_wrangling.get_existing_pdf_fields) - * [swap\_pdf\_page](#formfyxer.pdf_wrangling.swap_pdf_page) - * [copy\_pdf\_fields](#formfyxer.pdf_wrangling.copy_pdf_fields) - * [get\_textboxes\_in\_pdf](#formfyxer.pdf_wrangling.get_textboxes_in_pdf) - * [get\_bracket\_chars\_in\_pdf](#formfyxer.pdf_wrangling.get_bracket_chars_in_pdf) - * [intersect\_bbox](#formfyxer.pdf_wrangling.intersect_bbox) - * [intersect\_bboxs](#formfyxer.pdf_wrangling.intersect_bboxs) - * [contain\_boxes](#formfyxer.pdf_wrangling.contain_boxes) - * [get\_dist\_sq](#formfyxer.pdf_wrangling.get_dist_sq) - * [get\_dist](#formfyxer.pdf_wrangling.get_dist) - * [get\_connected\_edges](#formfyxer.pdf_wrangling.get_connected_edges) - * [bbox\_distance](#formfyxer.pdf_wrangling.bbox_distance) - * [get\_possible\_fields](#formfyxer.pdf_wrangling.get_possible_fields) - * [get\_possible\_checkboxes](#formfyxer.pdf_wrangling.get_possible_checkboxes) - * [get\_possible\_radios](#formfyxer.pdf_wrangling.get_possible_radios) - * [get\_possible\_text\_fields](#formfyxer.pdf_wrangling.get_possible_text_fields) - * [auto\_add\_fields](#formfyxer.pdf_wrangling.auto_add_fields) - * [is\_tagged](#formfyxer.pdf_wrangling.is_tagged) - ---- -sidebar_label: pdf_wrangling -title: formfyxer.pdf_wrangling ---- - - - -## FieldType Objects - -```python -class FieldType(Enum) -``` - - - -#### TEXT - -Text input Field - - - -#### AREA - -Text input Field, but an area - - - -#### LIST\_BOX - -allows multiple selection - - - -#### CHOICE - -allows only one selection - - - -## FormField Objects - -```python -class FormField() -``` - -A data holding class, used to easily specify how a PDF form field should be created. - - - -#### \_\_init\_\_(field\_name: str, type\_name: Union[FieldType, str], x: int, y: int, font\_size: Optional[int] = None, tooltip: str = "", configs: Optional[Dict[str, Any]] = None) - -```python -def __init__(field_name: str, - type_name: Union[FieldType, str], - x: int, - y: int, - font_size: Optional[int] = None, - tooltip: str = "", - configs: Optional[Dict[str, Any]] = None) -``` - -Constructor - -**Arguments**: - -- `x` - the x position of the lower left corner of the field. Should be in X,Y coordinates, - where (0, 0) is the lower left of the page, x goes to the right, and units are in - points (1/72th of an inch) -- `y` - the y position of the lower left corner of the field. Should be in X,Y coordinates, - where (0, 0) is the lower left of the page, y goes up, and units are in points - (1/72th of an inch) -- `config` - a dictionary containing any keyword argument to the reportlab field functions, - which will vary depending on what type of field this is. See section 4.7 of the - [reportlab User Guide](https://www.reportlab.com/docs/reportlab-userguide.pdf) -- `field_name` - the name of the field, exposed to via most APIs. Not the tooltip, but `users1_name__0` - - - -#### set\_fields(in\_file: Union[str, Path, BinaryIO], out\_file: Union[str, Path, BinaryIO], fields\_per\_page: Iterable[Iterable[FormField]], \*, overwrite=False) - -```python -def set_fields(in_file: Union[str, Path, BinaryIO], - out_file: Union[str, Path, BinaryIO], - fields_per_page: Iterable[Iterable[FormField]], - *, - overwrite=False) -``` - -Adds fields per page to the in_file PDF, writing the new PDF to a new file. - -Example usage: - -```python -set_fields('no_fields.pdf', 'four_fields_on_second_page.pdf', - [ - [], # nothing on the first page - [ # Second page - FormField('new_field', 'text', 110, 105, configs=\{'width': 200, 'height': 30\}), - # Choice needs value to be one of the possible options, and options to be a list of strings or tuples - FormField('new_choices', 'choice', 110, 400, configs=\{'value': 'Option 1', 'options': ['Option 1', 'Option 2']\}), - # Radios need to have the same name, with different values - FormField('new_radio1', 'radio', 110, 600, configs=\{'value': 'option a'\}), - FormField('new_radio1', 'radio', 110, 500, configs=\{'value': 'option b'\}) - ] - ] -) -``` - -**Arguments**: - -- `in_file` - the input file name or path of a PDF that we're adding the fields to -- `out_file` - the output file name or path where the new version of in_file will - be written. Doesn't need to exist. -- `fields_per_page` - for each page, a series of fields that should be added to that - page. -- `owerwrite` - if the input file already some fields (AcroForm fields specifically) - and this value is true, it will erase those existing fields and just add - `fields_per_page`. If not true and the input file has fields, this won't generate - a PDF, since there isn't currently a way to merge AcroForm fields from - different PDFs. - - -**Returns**: - - Nothing. - - - -#### rename\_pdf\_fields(in\_file: Union[str, Path, BinaryIO], out\_file: Union[str, Path, BinaryIO], mapping: Mapping[str, str]) - -```python -def rename_pdf_fields(in_file: Union[str, Path, BinaryIO], - out_file: Union[str, Path, BinaryIO], - mapping: Mapping[str, str]) -> None -``` - -Given a dictionary that maps old to new field names, rename the AcroForm -field with a matching key to the specified value. - -**Example**: - -```python -rename_pdf_fields('current.pdf', 'new_field_names.pdf', - \{'abc123': 'user1_name', 'abc124', 'user1_address_city'\}) - -Args: - in_file: the filename of an input file - out_file: the filename of the output file. Doesn't need to exist, - will be overwritten if it does exist. - mapping: the python dict that maps from a current field name to the desired name - -Returns: - Nothing - - - -#### unlock\_pdf\_in\_place(in\_file: Union[str, Path, BinaryIO]) - -```python -def unlock_pdf_in_place(in_file: Union[str, Path, BinaryIO]) -> None -``` - -Try using pikePDF to unlock the PDF it it is locked. This won't work if it has a non-zero length password. - - - -#### has\_fields(pdf\_file: str) - -```python -def has_fields(pdf_file: str) -> bool -``` - -Check if a PDF has at least one form field using PikePDF. - -**Arguments**: - -- `pdf_file` _str_ - The path to the PDF file. - - -**Returns**: - -- `bool` - True if the PDF has at least one form field, False otherwise. - - - -#### get\_existing\_pdf\_fields(in\_file: Union[str, Path, BinaryIO, Pdf]) - -```python -def get_existing_pdf_fields( - in_file: Union[str, Path, BinaryIO, Pdf]) -> List[List[FormField]] -``` - -Use PikePDF to get fields from the PDF - - - -#### swap\_pdf\_page(\*, source\_pdf: Union[str, Path, Pdf], destination\_pdf: Union[str, Path, Pdf], source\_offset: int = 0, destination\_offset: int = 0, append\_fields: bool = False) - -```python -def swap_pdf_page(*, - source_pdf: Union[str, Path, Pdf], - destination_pdf: Union[str, Path, Pdf], - source_offset: int = 0, - destination_offset: int = 0, - append_fields: bool = False) -> Pdf -``` - -(DEPRECATED: use copy_pdf_fields) Copies the AcroForm fields from one PDF to another blank PDF form. Optionally, choose a starting page for both -the source and destination PDFs. By default, it will remove any existing annotations (which include form fields) -in the destination PDF. If you wish to append annotations instead, specify `append_fields = True` - - - -#### copy\_pdf\_fields(\*, source\_pdf: Union[str, Path, Pdf], destination\_pdf: Union[str, Path, Pdf], source\_offset: int = 0, destination\_offset: int = 0, append\_fields: bool = False) - -```python -def copy_pdf_fields(*, - source_pdf: Union[str, Path, Pdf], - destination_pdf: Union[str, Path, Pdf], - source_offset: int = 0, - destination_offset: int = 0, - append_fields: bool = False) -> Pdf -``` - -Copies the AcroForm fields from one PDF to another blank PDF form (without AcroForm fields). -Useful for getting started with an updated PDF form, where the old fields are pretty close to where -they should go on the new document. - -Optionally, you can choose a starting page for both -the source and destination PDFs. By default, it will remove any existing annotations (which include form fields) -in the destination PDF. If you wish to append annotations instead, specify `append_fields = True` - -**Example**: - -```python -new_pdf_with_fields = copy_pdf_fields( - source_pdf="old_pdf.pdf", - destination_pdf="new_pdf_with_no_fields.pdf") -new_pdf_with_fields.save("new_pdf_with_fields.pdf") -``` - - -**Arguments**: - -- `source_pdf` - a file name or path to a PDF that has AcroForm fields -- `destination_pdf` - a file name or path to a PDF without AcroForm fields. Existing fields will be removed. -- `source_offset` - the starting page that fields will be copied from. Defaults to 0. -- `destination_offset` - the starting page that fields will be copied to. Defaults to 0. -- `append_annotations` - controls whether formfyxer will try to append form fields instead of - overwriting. Defaults to false; when enabled may lead to undefined behavior. - - -**Returns**: - - A pikepdf.Pdf object with new fields. If `blank_pdf` was a pikepdf.Pdf object, the - same object is returned. - - - -#### get\_textboxes\_in\_pdf(in\_file: Union[str, Path, BinaryIO], line\_margin=0.02, char\_margin=2.0) - -```python -def get_textboxes_in_pdf(in_file: Union[str, Path, BinaryIO], - line_margin=0.02, - char_margin=2.0) -> List[List[Textbox]] -``` - -Gets all of the text boxes found by pdfminer in a PDF, as well as their bounding boxes - - - -#### get\_bracket\_chars\_in\_pdf(in\_file: Union[str, Path, BinaryIO], line\_margin=0.02, char\_margin=0.0) - -```python -def get_bracket_chars_in_pdf(in_file: Union[str, Path, BinaryIO], - line_margin=0.02, - char_margin=0.0) -> List -``` - -Gets all of the bracket characters ('[' and ']') found by pdfminer in a PDF, as well as their bounding boxes -TODO: Will eventually be used to find [ ] as checkboxes, but right now we can't tell the difference between [ ] and [i]. -This simply gets all of the brackets, and the characters of [hi] in a PDF and [ ] are the exact same distance apart. -Currently going with just "[hi]" doesn't happen, let's hope that assumption holds. - - - -#### intersect\_bbox(bbox\_a, bbox\_b, vert\_dilation=2, horiz\_dilation=2) - -```python -def intersect_bbox(bbox_a, bbox_b, vert_dilation=2, horiz_dilation=2) -> bool -``` - -bboxes are [left edge, bottom edge, horizontal length, vertical length] - - - -#### intersect\_bboxs(bbox\_a, bboxes, vert\_dilation=2, horiz\_dilation=2) - -```python -def intersect_bboxs(bbox_a, - bboxes, - vert_dilation=2, - horiz_dilation=2) -> Iterable[bool] -``` - -Returns an iterable of booleans, one of each of the input bboxes, true if it collides with bbox_a - - - -#### contain\_boxes(bbox\_a: BoundingBoxF, bbox\_b: BoundingBoxF) - -```python -def contain_boxes(bbox_a: BoundingBoxF, bbox_b: BoundingBoxF) -> BoundingBoxF -``` - -Given two bounding boxes, return a single bounding box that contains both of them. - - - -#### get\_dist\_sq(point\_a: XYPair, point\_b: XYPair) - -```python -def get_dist_sq(point_a: XYPair, point_b: XYPair) -> float -``` - -returns the distance squared between two points. Faster than the true euclidean dist - - - -#### get\_dist(point\_a: XYPair, point\_b: XYPair) - -```python -def get_dist(point_a: XYPair, point_b: XYPair) -> float -``` - -euclidean (L^2 norm) distance between two points - - - -#### get\_connected\_edges(point: XYPair, point\_list: Sequence) - -```python -def get_connected_edges(point: XYPair, point_list: Sequence) -``` - -point list is always ordered clockwise from the bottom left, -i.e. bottom left, top left, top right, bottom right - - - -#### bbox\_distance(bbox\_a: BoundingBoxF, bbox\_b: BoundingBoxF) - -```python -def bbox_distance( - bbox_a: BoundingBoxF, bbox_b: BoundingBoxF -) -> Tuple[float, Tuple[XYPair, XYPair], Tuple[XYPair, XYPair]] -``` - -Gets our specific "distance measure" between two different bounding boxes. -This distance is roughly the sum of the horizontal and vertical difference in alignment of -the closest shared field-bounding box edge. We are trying to find which, given a list of text boxes -around a field, is the most likely to be the actual text label for the PDF field. - -bboxes are 4 floats, x, y, width and height - - - -#### get\_possible\_fields(in\_pdf\_file: Union[str, Path], textboxes: Optional[List[List[Textbox]]] = None) - -```python -def get_possible_fields( - in_pdf_file: Union[str, Path], - textboxes: Optional[List[List[Textbox]]] = None -) -> List[List[FormField]] -``` - -Given an input PDF, runs a series of heuristics to predict where there -might be places for user enterable information (i.e. PDF fields), and returns -those predictions. - -**Example**: - -```python -fields = get_possible_fields('no_field.pdf') -print(fields[0][0]) -# Type: FieldType.TEXT, Name: name, User name: , X: 67.68, Y: 666.0, Configs: \{'fieldFlags': 'doNotScroll', 'width': 239.4, 'height': 16\} -``` - - -**Arguments**: - -- `in_pdf_file` - the input PDF -- `textboxes` _optional_ - the location of various lines of text in the PDF. - If not given, will be calculated automatically. This allows us to - pass through expensive info to calculate through several functions. - - -**Returns**: - - For each page in the input PDF, a list of predicted form fields - - - -#### get\_possible\_checkboxes(img: Union[str, cv2.Mat], find\_small=False) - -```python -def get_possible_checkboxes(img: Union[str, cv2.Mat], - find_small=False) -> Union[np.ndarray, List] -``` - -Uses boxdetect library to determine if there are checkboxes on an image of a PDF page. -Assumes the checkbox is square. - -find_small: if true, finds smaller checkboxes. Sometimes will "find" a checkbox in letters, -like O and D, if the font is too small - - - -#### get\_possible\_radios(img: Union[str, BinaryIO, cv2.Mat]) - -```python -def get_possible_radios(img: Union[str, BinaryIO, cv2.Mat]) -``` - -Even though it's called "radios", it just gets things shaped like circles, not -doing any semantic analysis yet. - - - -#### get\_possible\_text\_fields(img: Union[str, BinaryIO, cv2.Mat], text\_lines: List[Textbox], default\_line\_height: int = 44) - -```python -def get_possible_text_fields( - img: Union[str, BinaryIO, cv2.Mat], - text_lines: List[Textbox], - default_line_height: int = 44) -> List[Tuple[BoundingBox, int]] -``` - -Uses openCV to attempt to find places where a PDF could expect an input text field. - -Caveats so far: only considers straight, normal horizonal lines that don't touch any vertical lines as fields -Won't find field inputs as boxes - -default_line_height: the default height (16 pt), in pixels (at 200 dpi), which is 45 - - - -#### auto\_add\_fields(in\_pdf\_file: Union[str, Path], out\_pdf\_file: Union[str, Path]) - -```python -def auto_add_fields(in_pdf_file: Union[str, Path], out_pdf_file: Union[str, - Path]) -``` - -Uses [get_possible_fields](#formfyxer.pdf_wrangling.get_possible_fields) and -[set_fields](#formfyxer.pdf_wrangling.set_fields) to automatically add new detected fields -to an input PDF. - -**Example**: - -```python -auto_add_fields('no_fields.pdf', 'newly_added_fields.pdf') -``` - - -**Arguments**: - -- `in_pdf_file` - the input file name or path of the PDF where we'll try to find possible fields -- `out_pdf_file` - the output file name or path of the PDF where a new version of `in_pdf_file` will - be stored, with the new fields. Doesn't need to existing, but if a file does exist at that - filename, it will be overwritten. - - -**Returns**: - - Nothing - - - -#### is\_tagged(in\_pdf\_file: Union[str, Path, pikepdf.Pdf]) - -```python -def is_tagged(in_pdf_file: Union[str, Path, pikepdf.Pdf]) -> bool -``` - -Determines if the input PDF file is tagged for accessibility. - -**Arguments**: - -- `in_pdf_file` _Union[str, Path]_ - The path to the PDF file, as a string or a Path object. - - -**Returns**: - -- `bool` - True if the PDF is tagged, False otherwise. - diff --git a/fix-doc-titles.sh b/fix-doc-titles.sh index 8384a69f8..39dece3f3 100755 --- a/fix-doc-titles.sh +++ b/fix-doc-titles.sh @@ -6,6 +6,8 @@ echo "Fixing documentation titles and navigation..." +mv docs/components/formfyxer/* docs/components/FormFyxer + for file in $(find docs/components -name "*.md" -exec grep -l "# Table of Contents" {} \;); do # Extract the module name from the first TOC entry module_name=$(grep -m 1 "^\* \[.*\]" "$file" | sed 's/^\* \[\(.*\)\](#.*)/\1/' | sed 's/\\_/_/g')