Web crawling¶
Add public websites as sources without signing into a third-party document provider. Web crawling is useful for help centers, marketing sites, or any HTML pages your team wants Mention to treat like other documentation.
What gets synced¶
You give Mention a starting URL and options such as how deep to follow links and how many pages to include. The crawl visits linked pages (respecting normal web rules like robots.txt where applicable) and stores each page as readable text (similar to markdown) so headings and structure stay useful.
Individual pages can end up in different states—for example completed, skipped because of depth or limits, blocked by site policy, or errored if a page could not be loaded. Your admin UI shows which pages succeeded so you can decide what to activate.
How to connect¶
There is no separate vendor login for crawling itself.
- In Mention, create a web crawl (or site crawl) with the root URL you want to start from.
- Set depth and maximum pages to match how much of the site you want (your admin guide may recommend starting small).
- Run the crawl and wait until it finishes or reports status.
- Review discovered pages and activate the ones Mention should use.
Large sites can take several minutes; very long runs may time out and can be retried or narrowed. See Connecting sources for where this fits in your overall source setup.
Things to know¶
- Only publicly reachable pages you are allowed to crawl are good candidates; login walls and heavy dynamic sites may crawl poorly or be skipped.
- Robots.txt and similar rules can mark pages as disallowed; those pages will not be imported as usable content.
- After a crawl completes, new pages often start inactive until you explicitly activate them—similar to other sources.
Sync behavior¶
Crawls can run on a daily schedule so your snapshot of the site stays current. You can also start a new crawl manually (recrawl) when you publish major documentation updates and want Mention to re-fetch the site.
Activation limit¶
You can activate up to 500 pages per crawl.