|
112 | 112 | "source": [ |
113 | 113 | "## Project Lifecycle\n", |
114 | 114 | "\n", |
| 115 | + "Not every project will proceed in the same way, but projects generally have some \n", |
| 116 | + "important components in common.\n", |
| 117 | + "\n", |
| 118 | + "\n", |
| 119 | + "\n", |
| 120 | + "The solid arrows show the primary progressions or steps, while the dotted line \n", |
| 121 | + "represents the ongoing nature of problem understanding - uncovering more about\n", |
| 122 | + "the customer domain will influence every step of the process. We wil examine \n", |
| 123 | + "several of these iterative cycles of refinement in detail below. \n", |
115 | 124 | "Not every project will proceed in the same way, but projects generally have some common\n", |
116 | 125 | "important components.\n", |
117 | 126 | "\n", |
|
133 | 142 | "It's very rare that a real-world project will start with all the data necessary to get\n", |
134 | 143 | "to a satisfactory solution, much less to establish confidence.\n", |
135 | 144 | "\n", |
| 145 | + "In our case, we're going to assume that we have a decent sample of system *inputs*, \n", |
| 146 | + "in the form of but receipt images, but start without any fully annotated data. We find \n", |
| 147 | + "this is a not-unusual situation when automating an existing process. Instead, \n", |
| 148 | + "we'll walk through the process of building that out as we go along by collaborating with\n", |
| 149 | + "domain experts, and make our evals progressively more comprehensive.\n", |
136 | 150 | "In our case, we're going to assume that we have a decent sample of system *inputs*\n", |
137 | 151 | "(here, photographs of receipts), but start without any fully annotated data. We'll walk\n", |
138 | 152 | "through the process of incrementally expanding our test and training sets as we go along\n", |
|
498 | 512 | "### Action Decision\n", |
499 | 513 | "\n", |
500 | 514 | "Next, we need to close the loop and get to an actual decision based on receipts. This\n", |
| 515 | + "looks pretty similar, so we'll present the code without comment.\n", |
| 516 | + "\n", |
| 517 | + "Ordinarily one would start with the most capable model - `o3`, at this time - for a \n", |
| 518 | + "first pass, and then once correctness is established experiment with different models\n", |
| 519 | + "to analyze any tradeoffs for their business impact, and potentially consider whether \n", |
| 520 | + "they are remediable with iteration. A client may be willing to take a certain accuracy \n", |
| 521 | + "hit for lower latency or cost, or it may be more effective to change the architecture\n", |
| 522 | + "to hit cost, latency, and accuracy goals. We'll get into how to make these tradeoffs\n", |
| 523 | + "explicitly and objectively later on. \n", |
| 524 | + "\n", |
| 525 | + "For this cookbook, `o3` might be too good. We'll use `o4-mini` for our first pass, so \n", |
| 526 | + "that we get a few reasoning errors we can use to illustrate the means of addressing\n", |
| 527 | + "them when they occur.\n", |
| 528 | + "\n", |
| 529 | + "Next, we need to close the loop and get to an actual decision based on receipts. This\n", |
501 | 530 | "looks pretty similar, so we'll present the code without comment." |
502 | 531 | ] |
503 | 532 | }, |
|
887 | 916 | "metadata": {}, |
888 | 917 | "source": [ |
889 | 918 | "After you run that eval you'll be able to view it in the UI, and should see something\n", |
| 919 | + "like the below. \n", |
| 920 | + "\n", |
| 921 | + "(Note, if you have a Zero-Data-Retention agreement, this data is not stored\n", |
| 922 | + "by OpenAI, so will not be available in this interface.)\n", |
890 | 923 | "like:\n", |
891 | 924 | "\n", |
892 | 925 | "\n", |
|
1617 | 1650 | "ARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED.\n", |
1618 | 1651 | "```\n", |
1619 | 1652 | "\n", |
| 1653 | + "4. We added three examples, JSON input/output pairs wrapped in XML tags.\n", |
1620 | 1654 | "3. We added three examples, JSON input/output pairs wrapped in XML tags.\n", |
1621 | 1655 | "\n", |
1622 | 1656 | "With our prompt revisions, we'll regenerate the data to evaluate and re-run the same\n", |
|
0 commit comments