datasheet parser - page.json

Overview Files Versions Dependencies Discussions Activity
Raw
{
  "schema_version": 1,
  "type": "app",
  "slug": "datasheet-parser",
  "title": "datasheet parser",
  "brief": "Parses manufacturer PDF datasheets into structured wiki markdown with extracted diagrams, electrical specs, pin descriptions, and design charts, then publishes them to the Adom Wiki. Delegates extract",
  "version": "1.0.0",
  "tags": [],
  "license": "MIT",
  "discovery_triggers": [
    "parse a datasheet",
    "convert a datasheet",
    "download datasheet",
    "standardize a datasheet",
    "extract specs from datasheet",
    "show me the datasheet",
    "datasheet to wiki",
    "parse pdf datasheet",
    "import datasheet",
    "process datasheets",
    "batch parse datasheets",
    "datasheet parser"
  ],
  "discovery_pitch": "Convert manufacturer PDF datasheets into interactive wiki pages with stat cards, pin cards, and diagram galleries. Extracts text, images, and specs automatically.",
  "sample_prompts": [
    {
      "label": "Parse this datasheet",
      "prompt": "Parse the BME680 datasheet and publish it"
    },
    {
      "label": "Convert to wiki",
      "prompt": "Convert the BHI360 datasheet to wiki format"
    },
    {
      "label": "Extract specs",
      "prompt": "Extract the electrical specs from this VL53L8CX PDF"
    },
    {
      "label": "Standardize",
      "prompt": "Standardize the AP63357DV datasheet for our wiki"
    },
    {
      "label": "Show datasheet",
      "prompt": "Show me the datasheet for the bq27441"
    }
  ],
  "install": {
    "binary_name": "datasheet-parser",
    "install_dir": "",
    "install_hint": "",
    "version_cmd": ""
  },
  "readme": "---\nname: datasheet-parser\ndescription: Parses manufacturer PDF datasheets into structured wiki markdown with extracted diagrams, electrical specs, pin descriptions, and design charts, then publishes them to the Adom Wiki. Use when the user says \"parse a datasheet\", \"convert a datasheet\", \"download datasheet for [part]\", \"standardize a datasheet\", \"extract specs from datasheet\", or \"show me the datasheet for [part]\". Delegates extraction to the ds-extract service (docling + pdfplumber + PyMuPDF + confidence-routed rules engine) and only uses Claude vision on the escalation-queue bboxes.\n---\n\n# Datasheet Parser\n\nParse manufacturer PDF datasheets into structured wiki markdown. **Claude's role here is orchestrator + reviewer, not extractor.** The heavy lifting — rendering pages, running OCR, detecting layout/tables/figures, enumerating figure bboxes from the PDF object tree, computing confidence signals — happens on the `ds-extract` service. Claude only gets called on the bbox crops the service couldn't resolve on its own.\n\n## Architecture\n\n```\nPDF ──► POST /extract (ds-extract)\n          │\n          ├─► 1141 blocks typed by docling\n          ├─► table cells from pdfplumber (fallback when docling empty)\n          ├─► figure bboxes from PyMuPDF object tree\n          ├─► cross-extractor agreement (containment)\n          └─► rules engine → escalation_queue (~40 bboxes, not 67 pages)\n          │\nClaude ◄──┘\n  │\n  ├─► Batch escalation_queue crops via /extract-region → single vision call\n  ├─► Merge answers back into blocks\n  └─► Generate wiki markdown from the merged structured data\n      └─► adom-wiki page publish + asset upload\n```\n\nEverything before the \"Claude\" node is deterministic and runs on the service in a few minutes of CPU. Vision tokens are only spent on the ~40 bbox crops the service's confidence routing couldn't resolve.\n\n## Service URL\n\n```\nhttps://ds-extract-fa4sdo7pnkrl.adom.cloud/\n```\n\n## Interactive Invocation\n\nWhen the user triggers this skill with a bare phrase like \"parse the LM358 datasheet\" — and they have **not** already specified `--visualize` — use `AskUserQuestion` to get their preference:\n\n> **Open the live visualizer while parsing?**\n> - `Yes — open live view in a webview tab` → set `--visualize`\n> - `No — run silently` → proceed without the flag\n\nIf the user already specified, or the skill was invoked from `process-datasheets`, honor that and skip the question. Confirm in one line — `\"Parsing BQ76920 — starting.\"` — and proceed.\n\n## Arguments\n\n- `--visualize` — Open the `datasheet-visualizer` webview tab and emit live progress events through each step\n- `--no-visualize` — Run silently (suppresses the interactive question)\n\n## Queue Integration\n\nA shared queue at `https://wtqihf5e8fsv.adom.cloud` coordinates parsing across agents.\n\n```bash\nds-queue list                             # what needs parsing\nds-queue claim --by $(hostname)           # claim next item (returns id + pdf_url + part)\nds-queue complete <id> --by $(hostname) --wiki-slug \"datasheets/<part>\"\nds-queue fail <id> --by $(hostname) --reason \"<what broke>\"\n```\n\nClaim FIRST to prevent duplicate work when parsing from the queue.\n\n## Workflow\n\n### Step 1: Acquire the PDF\n\nThree sources:\n\n- **User URL** → `curl -sL -o /tmp/<part>.pdf \"$URL\"`\n- **Local path** → use directly\n- **Queue item** — check the `pdf_url` field:\n  - `http*` → download via curl\n  - `/uploads/*` → `curl -sL -o /tmp/<part>.pdf \"https://wtqihf5e8fsv.adom.cloud<upload_path>\"`\n- **No source** → WebSearch for the manufacturer's official PDF. Prefer `ti.com`, `bosch-sensortec.com`, `st.com`, `nxp.com`, `microchip.com`, `analog.com` over aggregators.\n\n### Step 2: Extract via ds-extract service\n\n```bash\ncurl -sS -F pdf=@/tmp/<part>.pdf https://ds-extract-fa4sdo7pnkrl.adom.cloud/extract \\\n  -o /tmp/<part>-extract.json\n```\n\nThis takes ~5–7 minutes on CPU for a 60-page datasheet (cached by sha256; re-requests are instant). Full service contract — endpoints, JSON schema, escalation-queue shape, how to re-run rules without re-extracting: [ds-extract-reference.md](ds-extract-reference.md).\n\n<<<<<<< Updated upstream\nAbridged shape:\n=======\n```bash\nemit_stage() { :; }\nemit_page_rendered() { :; }\nemit_page_annotated() { :; }\nemit_crop() { :; }\nemit_section() { :; }\nemit_published() { :; }\nemit_done() { :; }\n```\n\n### Step 1: Find the Datasheet\n\n```bash\nemit_stage download start\n```\n\nThere are three sources for the PDF:\n\n**A) User provides a URL** — use it directly.\n\n**B) User provides a local file path** — use it directly, skip Step 2.\n\n**C) Queue item** — check the `pdf_url` field:\n- If it starts with `http` → it's a download URL, proceed to Step 2.\n- If it starts with `/uploads/` → it's stored on the ds-queue server. Download it from the queue server:\n  ```bash\n  DS_QUEUE_URL=\"https://wtqihf5e8fsv.adom.cloud\"\n  curl -sL -o /tmp/<partname>.pdf \"$DS_QUEUE_URL<upload_path>\"\n  ```\n\n**D) No URL provided** — use WebSearch to find the official manufacturer datasheet PDF URL. Prefer the manufacturer's site over third-party aggregators. Common sources: ti.com, bosch-sensortec.com, st.com, nxp.com, microchip.com, analog.com.\n\n### Step 2: Download the PDF\n\n```bash\ncurl -sL -o /tmp/<partname>.pdf \"<URL>\"\nls -la /tmp/<partname>.pdf\nemit_stage download done\n```\n\nSkip this step if the PDF is already local (source B or C with `/uploads/` path already downloaded in Step 1).\n\n### Step 3: Extract Raw Text\n\n```bash\nemit_stage extract start\n```\n\n```bash\npdftotext /tmp/<partname>.pdf /tmp/<partname>.txt\n```\n\nRead the extracted text to get a rough overview of the datasheet contents. This text will have formatting issues, wrong reading order, mangled tables, etc. — that's expected. It serves as a **guide** for what's on each page, not the source of truth.\n\nAlso check page count:\n\n```bash\npdfinfo /tmp/<partname>.pdf | grep Pages\nemit_stage extract done\n```\n\n### Step 4: Render Full-Page PNGs and Perform Visual OCR\n\nThis is the core processing step. Full-page PNGs are rendered from the PDF, then Claude's native vision reads each page to extract accurate, well-structured content.\n\n#### 4a. Render pages with pdftoppm\n\n```bash\nemit_stage render start\nmkdir -p /home/adom/project/project-content/datasheets/<partname>\npdftoppm -png -r 300 /tmp/<partname>.pdf /home/adom/project/project-content/datasheets/<partname>/<partname>\n```\n\nThis produces `<partname>-1.png`, `<partname>-2.png`, etc. at 300 DPI — high enough for Claude to read all text, table cells, and fine diagram details.\n\n**CRITICAL — resize before reading.** The Read tool hard-limits at 2000 px per side in many-image requests. A 300-DPI letter/A4 render is ~2480×3508 px, which will crash the conversation the first time you try to Read a batch of pages. Produce a downscaled mirror at ≤1500 px long edge and Read from that mirror during Step 4b; keep the full-resolution originals for crop operations in Step 4c.\n\n```bash\nmkdir -p /home/adom/project/project-content/datasheets/<partname>/pages-1500\nfor png in /home/adom/project/project-content/datasheets/<partname>/<partname>-*.png; do\n  convert \"$png\" -resize '1500x1500>' \"/home/adom/project/project-content/datasheets/<partname>/pages-1500/$(basename \"$png\")\"\ndone\n```\n\nThe `>` flag only shrinks, never upscales — already-small renders pass through untouched.\n\nIf `--visualize` is set, also emit a downscaled preview for each page and a `page_rendered` event:\n\n```bash\nfor png in /home/adom/project/project-content/datasheets/<partname>/<partname>-*.png; do\n  n=$(basename \"$png\" | sed 's/.*-//;s/\\.png$//')\n  convert \"$png\" -resize 800x \"$VIZ_PAGES_DIR/$n.png\"\n  emit_page_rendered \"$n\" \"$n.png\"\ndone\nemit_stage render done\n```\n\n#### 4b. Page-by-page visual analysis\n\n```bash\nemit_stage ocr start\n```\n\nProcess each page PNG with Claude vision. For each page, provide:\n- The full-page PNG (via the Read tool)\n- The corresponding section of `pdftotext` output as a rough reference\n\nFor each page, Claude must extract:\n\n1. **Corrected text** — proper reading order (columns, sidebars, captions), with all formatting issues from `pdftotext` resolved by trusting the visual. When `pdftotext` and the visual disagree, **always trust the visual**.\n\n2. **Tables** — reconstruct as proper markdown tables with correct column alignment, merged cells expanded, and all values accurate. Pay close attention to min/typ/max columns, units, and footnote markers.\n\n3. **Formulas** — convert any mathematical formulas to LaTeX. Wrap in `$$...$$` for display math or `$...$` for inline. Read the formula directly from the PNG — do not rely on `pdftotext` for formula content.\n\n4. **Translations** — if any text is in a non-English language, translate it to English. Do not use any third-party translation tool. Claude translates natively. Preserve the original meaning precisely; for technical terms, keep the standard English engineering term.\n\n5. **Diagram manifest** — for every diagram, figure, chart, graph, photo, or illustration on the page, output a bounding box and metadata:\n   ```\n   DIAGRAM: {\n     id: \"page<N>_fig<M>\",\n     category: \"<category>\",\n     caption: \"<descriptive caption>\",\n     bbox: { x: <left_px>, y: <top_px>, w: <width_px>, h: <height_px> }\n   }\n   ```\n   **Bounding box accuracy is critical.** Follow these rules to avoid mis-crops:\n\n   - **Anchor on visual edges, not captions.** The bbox top (`y`) should be the top pixel of the diagram's graphical content (border, axis line, top of an image element), NOT the \"Figure N\" caption or preceding body text. Similarly the bottom should be the bottom of the graphical content. Include the figure caption/label only if it's visually attached to the diagram (e.g., directly below with no text paragraphs between).\n   - **Use structural landmarks.** At 300 DPI, a typical A4 page is ~2480×3508 px. Use known elements (page margins ~100-150px, headers/footers, column gutters) to cross-check your Y estimates. If a diagram sits in the bottom third of the page, `y` should be >~2300 — if your estimate is <1500, something is wrong.\n   - **Separate body text from diagram labels.** Paragraph text (full-width lines of body copy) is NOT part of the diagram. Callout labels, axis labels, and annotations that are spatially within or adjacent to the diagram ARE part of it.\n   - **Include all associated elements:** arrows, callout text, labels, dimension lines, legends, axis labels, and titles that logically belong to the diagram.\n   - **Add 5% padding** on all sides to catch stray annotations, but clamp so padding doesn't extend into unrelated text regions or off the page.\n\n   Categories:\n   | Category | Examples |\n   |----------|----------|\n   | `overview` | Product photos, block diagrams, system architecture |\n   | `schematic` | Internal schematics, application circuits, reference designs |\n   | `characteristic` | V-I curves, temperature curves, frequency response |\n   | `waveform` | Oscilloscope captures, timing diagrams |\n   | `mechanical` | Package drawings, dimensional drawings, footprints |\n   | `integration` | Mounting guides, thermal design, layout recommendations |\n   | `electrical` | Pin diagrams, pinout views, connector details |\n   | `timing` | Timing diagrams, sequencing charts, state machines |\n   | `design` | Design nomographs, selection charts |\n\n**Token efficiency tips:**\n- For simple pages (mostly one large table), processing is fast — one call per page.\n- For very long datasheets (>50 pages), batch 2-3 consecutive pages of pure spec tables into a single call since they share context.\n- Skip pages that are entirely blank, legal boilerplate, or revision history — note them as skipped.\n- Use parallel subagents to process pages concurrently when possible.\n\n**If `--visualize` is set**, emit a `page_annotated` event after each page's OCR completes, with the bboxes normalized to `[0,1]`. Example — compute normalized coords from the 300 DPI page geometry (e.g. `xn = x / 2480`, `yn = y / 3508` for A4):\n\n```bash\nBOXES='[{\"id\":\"b1\",\"kind\":\"table\",\"label\":\"Electrical Characteristics\",\"bbox\":[0.10,0.28,0.82,0.56]},\n        {\"id\":\"b2\",\"kind\":\"diagram\",\"label\":\"Figure 4: Pinout\",\"bbox\":[0.55,0.72,0.38,0.18]}]'\nemit_page_annotated <page_number> \"$BOXES\"\n```\n\nAt the end of 4b: `emit_stage ocr done`.\n\n#### 4c. Crop diagrams — two-pass approach\n\n```bash\nemit_stage crop start\n```\n\nDiagram cropping uses two passes: a fast programmatic pass for captioned figures, then a vision validation pass to catch issues and find uncaptioned diagrams.\n\n**Pass 1: Automated crop with `crop_figures.py`**\n\n```bash\npython3 ~/.claude/skills/datasheet-parser/crop_figures.py \\\n  /tmp/<partname>.pdf \\\n  /home/adom/project/project-content/datasheets/<partname>/ \\\n  --dpi 300 --padding 20\n```\n\nThis uses PyMuPDF to find `Figure N` captions in the PDF, identify nearby vector drawings and embedded images, exclude body text / headings, and render cropped PNGs directly. It produces a `figures-manifest.json` listing all extracted figures. Runs in seconds, zero vision tokens.\n\n**Pass 2: Vision validation + uncaptioned diagrams**\n\nRead each cropped PNG from Pass 1 and verify:\n- Diagram content is fully visible (not clipped)\n- No unrelated body text / headings leaked in\n- Caption is included\n\nFor any bad crops, use the bounding box from the page PNG to re-crop with `convert`:\n```bash\nconvert <partname>-<page>.png -crop <W>x<H>+<X>+<Y> +repage <output>.png\n```\n\nAlso identify **uncaptioned diagrams** found during the page-by-page visual OCR (Step 4b) that `crop_figures.py` could not detect (e.g., inline diagrams without \"Figure N\" labels). Crop these manually using `convert` with bounding boxes from the visual analysis.\n\n**Post-processing** (apply to all cropped diagrams):\n\n```bash\n# Anti-alias, resize, optimize, trim whitespace, add padding — single command\nconvert <diagram>.png -blur 0x0.5 -unsharp 0x1+0.5+0.05 \\\n  -resize 1200x1200\\> -colors 128 -trim +repage \\\n  -bordercolor white -border 30 <diagram>.png\n```\n\nTarget: each image 30–150KB after optimization.\n\n**If `--visualize` is set**, emit a `crop` event for each diagram as it's validated or rejected:\n\n```bash\nemit_crop \"<crop_id>\" \"<bbox_id_from_ocr>\" validated    # or: detected, rejected\n```\n\nAt the end of 4c: `emit_stage crop done`.\n\n#### 4d. Synthesize across pages\n\n```bash\nemit_stage synthesize start\n```\n\nAfter all pages are processed, combine the per-page outputs into a unified document:\n- Merge tables that span multiple pages (watch for repeated headers)\n- Deduplicate page headers/footers\n- Resolve cross-references (\"see Table 3\", \"refer to Figure 12\")\n- Organize content into the standardized wiki sections (see Step 5)\n\n```bash\nemit_stage synthesize done\n```\n\n### Step 5: Generate Wiki Markdown\n\n```bash\nemit_stage generate start\n```\n\nBuild a structured markdown file that the wiki server renders into the tabbed UI. The wiki's `parseDatasheetSections()` function splits on `## ` headings and maps them to tabs.\n\n**CRITICAL: The heading names must match exactly** — the wiki uses `DS_TAB_MAP` to route sections to tabs:\n\n| Heading | Tab |\n|---------|-----|\n| `## Description` | Overview |\n| `## Key Specifications` | Overview |\n| `## Features` | Overview |\n| `## Pin Configuration` | Pinout |\n| `## Pinout` | Pinout |\n| `## Absolute Maximum Ratings` | Specifications |\n| `## Recommended Operating Conditions` | Specifications |\n| `## Electrical Characteristics` | Specifications |\n| `## Power Consumption` | Specifications |\n| `## Power Domains` | Specifications |\n| `## Thermal Information` | Specifications |\n| `## Timing Accuracy` | Specifications |\n| `## Communication Interface` | Specifications |\n| `## Packages` | Specifications |\n| `## Software API` | Software |\n| `## Applications` | Applications |\n| `## Key Formulas` | Applications |\n| Any unrecognized heading | Specifications (fallback) |\n\n**DO NOT include `## Diagrams`** — the Diagrams tab is populated automatically from screenshot assets uploaded to the wiki.\n\n#### Badge metadata (first 15 lines)\n\nThe wiki scans the first 15 lines for `**Key:** Value` patterns to extract badges:\n\n```markdown\n**Source:** [Texas Instruments Datasheet (SNAS548D)](https://www.ti.com/lit/ds/symlink/lm555.pdf)\n**Manufacturer:** Texas Instruments\n**Part Number:** LM555\n**Document:** SNAS548D — Rev D, January 2015\n```\n\n#### Special rendering rules\n\n1. **Key Specifications** — use a **2-column** table (`Parameter | Value`). The wiki renders this as stat cards:\n   ```markdown\n   ## Key Specifications\n\n   | Parameter | Value |\n   | --- | --- |\n   | Supply Voltage | 4.5V to 16V |\n   | Output Current | 200mA |\n   | Pin Count | 8 |\n   ```\n\n2. **Pin Configuration** — table must have a `Name` column. Optional: `Pin` or `#`, `Type` or `Direction`, `Description` or `Desc` or `Function`. Renders as pin cards:\n   ```markdown\n   ## Pin Configuration\n\n   | Pin | Name | Type | Description |\n   | --- | --- | --- | --- |\n   | 1 | GND | Power | Ground reference |\n   | 2 | TRIG | Input | Trigger input |\n   ```\n\n3. **Features** — use `- ` unordered lists. Renders with triangle bullet markers:\n   ```markdown\n   ## Features\n\n   - Direct replacement for SE555/NE555\n   - Timing from microseconds through hours\n   ```\n\n4. **Formulas** — use LaTeX notation. Display math with `$$...$$`, inline with `$...$`:\n   ```markdown\n   ## Key Formulas\n\n   ### Monostable Mode\n   $$t = 1.1 \\times R \\times C$$\n\n   ### Astable Mode\n   $$f = \\frac{1.44}{(R_A + 2R_B) \\times C}$$\n\n   $$\\text{Duty Cycle} = \\frac{R_A + R_B}{R_A + 2R_B}$$\n   ```\n\n5. **Tables** — use standard GFM pipe tables. All other tables render as styled HTML tables.\n\n6. **Inline formatting** — `**bold**`, `*italic*`, `` `code` ``, `[text](url)` are supported.\n\n#### Complete example\n\n```markdown\n**Source:** [Texas Instruments Datasheet (SNAS548D)](https://www.ti.com/lit/ds/symlink/lm555.pdf)\n**Manufacturer:** Texas Instruments\n**Part Number:** LM555\n**Document:** SNAS548D — Rev D, January 2015\n\n## Description\n\nThe LM555 is a highly stable device for generating accurate time delays or oscillation.\n\n## Key Specifications\n\n| Parameter | Value |\n| --- | --- |\n| Supply Voltage | 4.5V to 16V |\n| Output Current | 200mA |\n| Pin Count | 8 |\n| Temperature Stability | 0.005%/°C |\n\n## Features\n\n- Direct replacement for SE555/NE555\n- Timing from microseconds through hours\n- Operates in both astable and monostable modes\n- Output can source or sink 200mA\n\n## Pin Configuration\n\n| Pin | Name | Type | Description |\n| --- | --- | --- | --- |\n| 1 | GND | Power | Ground reference |\n| 2 | TRIG | Input | Trigger input; starts timing below 1/3 VCC |\n| 3 | OUT | Output | Timer output; sources/sinks 200mA |\n| 4 | RESET | Input | Active-low reset |\n| 5 | CTRL | Input | Control voltage; bypass with 10nF |\n| 6 | THR | Input | Threshold; ends cycle above 2/3 VCC |\n| 7 | DIS | Output | Open-collector discharge |\n| 8 | VCC | Power | Supply 4.5V to 16V |\n\n## Absolute Maximum Ratings\n\n| Parameter | Min | Max | Unit |\n| --- | --- | --- | --- |\n| Supply Voltage | — | 18 | V |\n| Power Dissipation | — | 600 | mW |\n| Storage Temperature | -65 | 150 | °C |\n\n## Electrical Characteristics\n\n| Parameter | Conditions | Min | Typ | Max | Unit |\n| --- | --- | --- | --- | --- | --- |\n| Supply Voltage | — | 4.5 | — | 16 | V |\n| Supply Current (Low) | VCC=5V, RL=∞ | — | 3 | 6 | mA |\n| Supply Current (High) | VCC=15V, RL=∞ | — | 10 | 15 | mA |\n\n## Thermal Information\n\n| Package | RθJA | Unit |\n| --- | --- | --- |\n| PDIP-8 | 97 | °C/W |\n| SOIC-8 | 149 | °C/W |\n\n## Packages\n\n| Package | Pins | Body Size |\n| --- | --- | --- |\n| PDIP (P) | 8 | 9.81mm × 6.35mm |\n| SOIC (D) | 8 | 4.90mm × 3.91mm |\n\n## Applications\n\n- Precision timing\n- Pulse generation\n- Sequential timing\n- PWM generation\n\n## Key Formulas\n\n### Monostable Mode\n$$t = 1.1 \\times R \\times C$$\n\n### Astable Mode\n$$f = \\frac{1.44}{(R_A + 2R_B) \\times C}$$\n\n$$\\text{Duty Cycle} = \\frac{R_A + R_B}{R_A + 2R_B}$$\n```\n\nSave this file as:\n```\n/home/adom/project/project-content/datasheets/<partname>/<partname>.md\n```\n\nAnd also write a temporary copy for publishing:\n```bash\ncp project-content/datasheets/<partname>/<partname>.md /tmp/<partname>-body.md\n```\n\n**If `--visualize` is set**, emit a `section` event for each generated section so the wiki-preview pane can fill it in live. Use the tab IDs from `DS_TAB_MAP`: `overview`, `pinout`, `specifications`, `software`, `applications`.\n\n```bash\n# Example: emit the overview section after writing it\nHTML_B64=$(markdown-to-html \"/tmp/overview.md\" | base64 -w0)   # use any md→html tool available\nemit_section overview \"$HTML_B64\"\n```\n\nIf no converter is available, the simplest path is to send the raw markdown as HTML wrapped in `<pre>` — the UI will render it verbatim and that's good enough for the live-preview case.\n\nAt the end of Step 5: `emit_stage generate done`.\n\n### Step 6: Prepare Metadata JSON\n\nCreate a metadata JSON file for the wiki page:\n>>>>>>> Stashed changes\n\n```json\n{\n  \"pdf_hash\": \"sha256...\",\n  \"page_count\": 67,\n  \"pages\": [\n    {\n      \"page\": 1, \"width_pt\": 612, \"height_pt\": 792,\n      \"page_png_url\": \"/artifact/<hash>/p1.png\",\n      \"blocks\": [\n        {\n          \"block_id\": \"p1_b4\", \"type\": \"title\",\n          \"bbox_pt\": [x0, y0, x1, y1],\n          \"text\": \"BQ76920, BQ76930, BQ76940\",\n          \"confidence\": 0.9,\n          \"cross_agreement\": {\"pdftotext\": 1.0, \"pdfplumber\": 1.0},\n          \"findings\": [...]\n        }\n      ],\n      \"object_figures\": [...]\n    }\n  ],\n  \"document_findings\": [...],\n  \"escalation_queue\": [\n    {\n      \"block_id\": \"p8_b4\",\n      \"page\": 8, \"bbox_pt\": [...], \"block_type\": \"table\",\n      \"priority\": 20, \"rule\": \"table.has_cells\",\n      \"question\": \"Extract this table as markdown...\"\n    }\n  ],\n  \"rules_summary\": {\n    \"block_findings\": {\"info\": 113, \"warn\": 145, \"escalate\": 40},\n    \"escalations\": 40\n  }\n}\n```\n\n### Step 3: Process the escalation queue\n\nFor every entry in `escalation_queue`, fetch a tight bbox crop and ask Claude vision the rule-specific question. Each crop is ~100–500 px wide — tiny. Batching strategy, merge semantics per block type: [ds-extract-reference.md](ds-extract-reference.md) § Processing the queue.\n\n```bash\n# For each escalation:\ncurl -sS -X POST https://ds-extract-fa4sdo7pnkrl.adom.cloud/extract-region \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"pdf_hash\": \"...\", \"page\": 8, \"bbox_pt\": [x0, y0, x1, y1], \"dpi\": 300}' \\\n  -o /tmp/crop-<block_id>.png\n```\n\nThen **Read** each crop PNG and answer the `question` field. Batch where possible — reading 5–10 small crops in one turn is cheap.\n\nFor tables: expect a markdown table back. For figures: a \"complete / partial / phantom\" verdict plus a brief description. For low-agreement text: the verbatim text.\n\nWrite each verdict back into the corresponding block's `text` or `table_cells` field in memory.\n\n### Step 4: Assemble the wiki markdown\n\nWalk the merged `pages[].blocks` and group by docling's semantic section structure. The wiki renders sections by exact heading match — see [publish-and-events-reference.md](publish-and-events-reference.md) § Heading-to-tab routing for the full `DS_TAB_MAP`.\n\nKey rules (the wiki silently drops content that breaks these):\n\n- **Key Specifications** must be a 2-column table (`Parameter | Value`)\n- **Pin Configuration** must have a `Name` column; optional `Pin`, `Type`, `Description`\n- **Formulas** use LaTeX: `$$...$$` (display) or `$...$` (inline)\n- **Features** use `- ` unordered lists\n- **Do NOT include `## Diagrams`** — that tab is populated from `screenshot` assets\n\nSave the body to `/home/adom/project/project-content/datasheets/<part>/<part>.md` plus a temp copy at `/tmp/<part>-body.md`.\n\n### Step 5: Publish\n\n```bash\nadom-wiki page publish \"datasheets/<part>\" \\\n  --title \"<Part> — <one-line description>\" \\\n  --brief \"<2–3 sentence summary>\" \\\n  --body-md /tmp/<part>-body.md \\\n  --changelog \"Parsed from <manufacturer> <doc number>\" \\\n  --sample-prompt \"Show me the datasheet for <part>\" \\\n  --sample-prompt \"Parse the <family> datasheet\" \\\n  --sample-prompt \"<3–5 trigger phrases total>\"\n\n# Set metadata\nadom-wiki page edit --field metadata \\\n  --body-md /home/adom/project/project-content/datasheets/<part>/metadata.json \\\n  \"datasheets/<part>\"\n```\n\n### Step 6: Upload the hero + screenshot assets\n\nFor each `figure` block in the extraction that passed rules (or that Claude validated), fetch its crop and upload:\n\n```bash\n# Hero: pick the functional block diagram or overview\nadom-wiki asset upload datasheets/<part> --asset-type hero-image \\\n  --file <best-figure>.png \\\n  --caption \"<description>\"\n\n# Screenshots: every notable figure\nadom-wiki asset upload datasheets/<part> --asset-type screenshot \\\n  --file <figure>.png \\\n  --caption \"<figure caption from paired caption block>\"\n```\n\nTarget 5–20 screenshots. Prefer the figures with captions already paired by ds-extract. **Always explicitly set `hero_asset_id`** to avoid the implicit-fallback bug — see [publish-and-events-reference.md](publish-and-events-reference.md) § Hero image pitfall.\n\n### Step 7: Store artifacts + complete the queue\n\n```bash\n# PDF, extracted JSON, and all crops go to project-content\ncp /tmp/<part>.pdf /home/adom/project/project-content/datasheets/<part>/\nmv /tmp/<part>-extract.json /home/adom/project/project-content/datasheets/<part>/\n\n# Mark queue item complete (if this came from the queue)\nds-queue complete <id> --by $(hostname) --wiki-slug \"datasheets/<part>\"\n```\n\nFull directory layout + queue semantics: [publish-and-events-reference.md](publish-and-events-reference.md) § Step 7.\n\n## Live Visualization\n\nWhen `--visualize` is set, run the `datasheet-visualizer` service and emit stage/page/section/crop events so the user can watch live. Details (service install, event schema, stage names, emit helpers, batch behavior between datasheets): see [publish-and-events-reference.md](publish-and-events-reference.md) § Live Visualization.\n\n## Troubleshooting\n\n| Symptom | Cause | Fix |\n| --- | --- | --- |\n| `/extract` 500 or hangs > 10 min | Service container OOM or docling crashed | `curl /health`; if down, SSH in and `bash ~/ds-extract/start.sh`. See service logs at `/tmp/ds-extract.log` |\n| Vendor PDF GET hangs / `HTTP/2 stream INTERNAL_ERROR` (curl exit 92) | ST / Akamai (and some other Akamai-fronted vendor CDNs) silently RST the connection on curl/reqwest's HTTP/2 fingerprint | **Fall through to `wget`** — its HTTP/1.1 negotiation looks browser-enough to get through. Pattern: `wget -q --tries=2 --timeout=30 --user-agent \"Mozilla/5.0 (X11; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0\" -O /tmp/<part>.pdf <url>`. Confirmed 2026-04-26 against `https://www.st.com/resource/en/datasheet/vl53l8cx.pdf` — curl/reqwest both reset; wget got the full 3.6 MB %PDF in ~12 s. The `aci component` orchestrator now does this fall-through automatically. |\n| Mouser/DigiKey URL returns ~13 KB HTML \"Access denied\" | Vendor anti-bot fronting | Always pass a Firefox UA; if still blocked, switch to the manufacturer's canonical CDN (e.g. `diodes.com/assets/Datasheets/<part>.pdf`, `datasheets.raspberrypi.com/<part>/<part>-datasheet.pdf`). Always verify the response starts with `%PDF-` magic bytes before posting to ds-extract — the 13 KB HTML page would otherwise eat 7 minutes in the parser |\n| Many tables in escalation queue | docling + pdfplumber both failed on this table style | Expected — hand off to Claude vision on those bboxes. If > 50% of tables fail, the PDF is likely image-only (scanned) and needs OCR preprocessing |\n| Wiki page has no tabs | Heading names don't match `DS_TAB_MAP` exactly | See `DS_TAB_MAP` in publish-and-events-reference.md — heading strings are case-sensitive |\n| Hero image shows something unexpected | `hero_asset_id` not set explicitly and wiki picked an older asset | Always call `page edit --field hero_asset_id <id>` after uploading the hero |\n| Slug collision with existing molecule page | `datasheets/<slug>` and an existing molecule share the slug | Datasheet content merges into the molecule page (unified `pages` table). Decide intentionally: keep merged, or publish to `<slug>-datasheet` |\n| `adom-wiki page publish` rejects \"requires 3-8 sample prompts\" | Missing `--sample-prompt` flags | Add 3–5 `--sample-prompt` flags (trigger phrases) |\n\n## Dependencies\n\nThe ds-extract service handles all Python/OCR deps. On the user's container you only need:\n\n- `curl` (every container has this)\n- `adom-wiki` CLI (installed by `node install.mjs`)\n- `ds-queue` CLI (same)\n",
  "author": {
    "id": "695820315b5f1e4db2fcf602",
    "name": "Kyle Bergstedt",
    "email": "[email protected]"
  },
  "visibility": {
    "public": true
  },
  "hero": null,
  "metadata": {},
  "created_at": "2026-05-28T05:29:12.641Z",
  "updated_at": "2026-05-28T05:29:12.641Z",
  "skills": []
}