web surface inspector
fetch robots.txt, sitemap.xml, and the home page's <head> in parallel to summarise the public web surface a domain advertises. part of the drwho.me domain dossier.
web-surface
infofetching…
Overview
three signals, one glance. `robots.txt` at the domain root declares crawler rules (which user-agents may fetch which paths) — its presence and contents hint at indexability intent. `sitemap.xml` (or a linked `sitemap_index.xml`) enumerates the URLs a site wants indexed; the count of `<loc>` entries is a crude measure of site footprint. the home-page `<head>` carries the visible-to-search-and-social-previews metadata: the `<title>`, the `<meta name="description">`, the OpenGraph (`og:*`) set used by facebook/linkedin/slack link unfurls, and the twitter card (`twitter:*`) set. this tool issues three parallel GETs with a shared 5s timeout, truncates bodies for safety (4KB for robots, 64KB for head), and does best-effort regex extraction — not a full HTML parser. if the home page fails, the whole check errors; robots and sitemap are treated as optional and silently marked absent on non-2xx or connection errors.
How to use
- enter a bare domain — public fqdn only. no schemes, ports, or paths.
- read robots / sitemap as indexability signals — presence of both is what a well-SEO'd content site looks like. a landing page may have neither — that's fine, it just means crawlers get no explicit guidance and the home page alone is the indexable surface. click the robots `<details>` to inspect the first 4KB of rules.
- inspect the OG / Twitter block for social previews — when you paste a URL into slack/discord/linkedin/twitter, the unfurl card is built from `og:title`, `og:description`, `og:image`, and `twitter:card`. missing `og:image` means a link to this page will render as a plain text row with no thumbnail. missing `twitter:card` defaults to `summary` in most clients — fine for text-heavy pages, bad for marketing pages.
Examples
input
github.comoutput
robots.txt: present (standard crawl-rules for google/bing/etc). sitemap.xml: present, a sitemap-index with thousands of sub-sitemaps (this tool reports the top-level `<loc>` count, not the recursed total). head: title "GitHub · Build and ship software on a single, collaborative platform", full OG set with `og:image`, `twitter:card: summary_large_image`. a textbook content-heavy site.input
example.comoutput
robots.txt: absent. sitemap.xml: absent. head: title "Example Domain", no description, no OG, no twitter tags. a deliberately minimal landing page — no indexable surface beyond the home page and no social-preview metadata. a link to example.com in slack unfurls as just the URL.Common mistakes
regex head parsing misses some HTML edge cases— this tool extracts `<title>`, `<meta>`, and OG/Twitter tags using regex against the first 64KB of the home-page body — it does NOT build a DOM. unusual but valid HTML (CDATA in `<title>`, attribute values with single-quoted contents that themselves contain double quotes, comments hiding fake `<meta>` elements, SSR'd head content injected via client-side JS) will confuse it. if the fields come back empty on a page you know has them, the real HTML parser in a browser is authoritative — this is a best-effort view.the sitemap you see may be a sitemap-index— larger sites publish `/sitemap.xml` as a list of sub-sitemap URLs rather than a list of content URLs. this tool counts top-level `<loc>` elements unconditionally — so an index file of 50 sub-sitemaps reports `urlCount: 50`, not the sum of each sub-sitemap's URL count. to get the real number you'd have to fetch and recurse each `<loc>`. that's out of scope here.robots is only fetched at /robots.txt — no per-subpath check— the `robots.txt` standard says the file lives at exactly the domain root. but crawlers also respect `<meta name="robots">` tags inside specific pages, and some CDN setups serve a different robots.txt per subpath via rewrite rules. this tool only reports the root-level file. a page with `<meta name="robots" content="noindex">` will still get crawled past robots.txt; the noindex happens later.
FAQ
what's the difference between `og:*` and `twitter:*` tags?
`og:*` (OpenGraph) is the facebook-originated standard that most platforms (linkedin, slack, discord, whatsapp) now read. `twitter:*` (Twitter Cards) is twitter-specific and predates most OG readers falling back to OG. best practice: set both. twitter will use `twitter:*` if present, otherwise fall back to `og:*`. most other platforms read only OG. to avoid drift, set `twitter:card` (for layout), `twitter:site`, and let everything else come from OG.
why doesn't this tool follow `sitemap: <url>` lines inside robots.txt?
robots.txt may include one or more `Sitemap: https://…` directives pointing at non-standard sitemap locations. this tool does not parse those — it only fetches the conventional `/sitemap.xml`. so a site that publishes its sitemap at `/sitemap_google.xml` will render as `sitemap.xml: absent` here even though search engines find it just fine via robots.txt. a future refinement could parse robots and follow sitemap lines; for now the check is fixed-path.
how big is the response body I'm parsing?
the home page body is truncated to the first 64KB before head-parsing, so if the `<head>` closes past byte 65536 the tool will silently miss it. in practice heads are almost always in the first 8-16KB. the robots.txt body is truncated to 4KB for display purposes. the sitemap body is not truncated for parsing — we need to count all `<loc>` tags.
why is the whole check `error` if robots is missing, but `ok` if sitemap is missing?
it isn't — both robots and sitemap are treated as optional. a 404 on either is fine; the check stays `ok`. the check only errors if the *home page* fetch (`GET https://<domain>/`) itself fails — DNS failure, TCP refusal, or the request body throws mid-read. in that case the other two signals are meaningless because we have no head to summarise.
why doesn't the tool send my cookies or custom user-agent?
we send a fixed `User-Agent: drwho-dossier/1.0 (+https://drwho.me)` and no cookies. sites that serve a different home page to logged-in users or that bot-detect on the UA will show you the anonymous-crawler view, which is also what search engines and social-preview bots see. so this view is what the public web sees, which is usually what you want for an SEO dossier.
Related tools
dns records lookup— resolve A, AAAA, NS, SOA, CAA, and TXT records for a domain in one go.security headers checker— inspect the response headers served at https://<domain>/ — HSTS, CSP, X-Frame-Options, etc.redirect checker— trace the HTTP(S) redirect chain from https://<domain>/ up to 10 hops.