Issue: PDFs that are "image-based" (scanned photos) vs. "text-based" (digital exports). Fix: Always run an OCR layer (Google Vision, Microsoft Read) before attempting an anchor-based extraction.
Start: Is the data in a structured table?
├─ Yes → Use Data Scraper (UiPath) / Extract Data (AA)
│ If table rows/cols change → Use wildcard selectors
│
├─ No → Is it plain text on screen?
│ ├─ Yes → Screen Scrape (FullText / OCR if image-based)
│ ├─ No → Is it inside a PDF / scanned doc?
│ ├─ Yes → OCR + anchor phrases (e.g., "Total Due:")
│ └─ No → Use regex on raw text source
│
└─ Is the data inside an email or API response?
→ Use specific connectors (IMAP, HTTP) + parse JSON/HTML
Most enterprise RPA tools (UiPath, Automation Anywhere, Blue Prism, Microsoft Power Automate) include extractor wizards. These are typically broken down into four distinct methodologies: rpa extractor