web surface inspector

fetch robots.txt, sitemap.xml, and the home page's <head> in parallel to summarise the public web surface a domain advertises. part of the drwho.me domain dossier.

web-surface

info

Open standalone →

HTTPS surface reachable (robots ✓, sitemap ✓, title ✓)

Why it matters: Public files — robots.txt, sitemap.xml, head meta — are what attackers see first during reconnaissance. Misadvertised paths, stale sitemaps, and verbose generators leak more than intended (ISO 27001 A.8.9).

robots.txt

present

Sitemap: https://firstdraftnews.org/sitemap.xml
Sitemap: https://firstdraftnews.org/news-sitemap.xml
# START YOAST BLOCK
# ---------------------------
User-agent: *
Disallow:

Sitemap: https://firstdraftnews.org/sitemap_index.xml
# ---------------------------
# END YOAST BLOCK

sitemap.xml

present — 3 url(s)

head

title: Home
description: —

social

og:url: https://firstdraftnews.org/
og:type: website
og:title: Home
og:description
og:locale: en_US
og:site_name: First Draft
twitter:card: summary_large_image
twitter:site: @firstdraftnews

fetched 2026-05-23T09:33:17.862Z

web-surface

info

Open standalone →

fetching…

Overview

three signals, one glance. `robots.txt` at the domain root declares crawler rules (which user-agents may fetch which paths) — its presence and contents hint at indexability intent. `sitemap.xml` (or a linked `sitemap_index.xml`) enumerates the URLs a site wants indexed; the count of `<loc>` entries is a crude measure of site footprint. the home-page `<head>` carries the visible-to-search-and-social-previews metadata: the `<title>`, the `<meta name="description">`, the OpenGraph (`og:*`) set used by facebook/linkedin/slack link unfurls, and the twitter card (`twitter:*`) set. this tool issues three parallel GETs with a shared 5s timeout, truncates bodies for safety (4KB for robots, 64KB for head), and does best-effort regex extraction — not a full HTML parser. if the home page fails, the whole check errors; robots and sitemap are treated as optional and silently marked absent on non-2xx or connection errors.

How to use

enter a bare domain — public fqdn only. no schemes, ports, or paths.
read robots / sitemap as indexability signals — presence of both is what a well-SEO'd content site looks like. a landing page may have neither — that's fine, it just means crawlers get no explicit guidance and the home page alone is the indexable surface. click the robots `<details>` to inspect the first 4KB of rules.
inspect the OG / Twitter block for social previews — when you paste a URL into slack/discord/linkedin/twitter, the unfurl card is built from `og:title`, `og:description`, `og:image`, and `twitter:card`. missing `og:image` means a link to this page will render as a plain text row with no thumbnail. missing `twitter:card` defaults to `summary` in most clients — fine for text-heavy pages, bad for marketing pages.

Examples

example 1 — when the root sitemap is itself an index file, the `urlCount` number underrepresents the real URL surface — the tool does not recurse.

input
github.com

output
robots.txt: present (standard crawl-rules for google/bing/etc). sitemap.xml: present, a sitemap-index with thousands of sub-sitemaps (this tool reports the top-level `<loc>` count, not the recursed total). head: title "GitHub · Build and ship software on a single, collaborative platform", full OG set with `og:image`, `twitter:card: summary_large_image`. a textbook content-heavy site.

example 2 — this is the baseline: it shows you what 'no SEO, no social cards' looks like. anything above this is intentional.

input
example.com

output
robots.txt: absent. sitemap.xml: absent. head: title "Example Domain", no description, no OG, no twitter tags. a deliberately minimal landing page — no indexable surface beyond the home page and no social-preview metadata. a link to example.com in slack unfurls as just the URL.

Common mistakes

regex head parsing misses some HTML edge cases — this tool extracts `<title>`, `<meta>`, and OG/Twitter tags using regex against the first 64KB of the home-page body — it does NOT build a DOM. unusual but valid HTML (CDATA in `<title>`, attribute values with single-quoted contents that themselves contain double quotes, comments hiding fake `<meta>` elements, SSR'd head content injected via client-side JS) will confuse it. if the fields come back empty on a page you know has them, the real HTML parser in a browser is authoritative — this is a best-effort view.
the sitemap you see may be a sitemap-index — larger sites publish `/sitemap.xml` as a list of sub-sitemap URLs rather than a list of content URLs. this tool counts top-level `<loc>` elements unconditionally — so an index file of 50 sub-sitemaps reports `urlCount: 50`, not the sum of each sub-sitemap's URL count. to get the real number you'd have to fetch and recurse each `<loc>`. that's out of scope here.
robots is only fetched at /robots.txt — no per-subpath check — the `robots.txt` standard says the file lives at exactly the domain root. but crawlers also respect `<meta name="robots">` tags inside specific pages, and some CDN setups serve a different robots.txt per subpath via rewrite rules. this tool only reports the root-level file. a page with `<meta name="robots" content="noindex">` will still get crawled past robots.txt; the noindex happens later.

FAQ

what's the difference between `og:*` and `twitter:*` tags?

`og:*` (OpenGraph) is the facebook-originated standard that most platforms (linkedin, slack, discord, whatsapp) now read. `twitter:*` (Twitter Cards) is twitter-specific and predates most OG readers falling back to OG. best practice: set both. twitter will use `twitter:*` if present, otherwise fall back to `og:*`. most other platforms read only OG. to avoid drift, set `twitter:card` (for layout), `twitter:site`, and let everything else come from OG.

why doesn't this tool follow `sitemap: <url>` lines inside robots.txt?

robots.txt may include one or more `Sitemap: https://…` directives pointing at non-standard sitemap locations. this tool does not parse those — it only fetches the conventional `/sitemap.xml`. so a site that publishes its sitemap at `/sitemap_google.xml` will render as `sitemap.xml: absent` here even though search engines find it just fine via robots.txt. a future refinement could parse robots and follow sitemap lines; for now the check is fixed-path.

how big is the response body I'm parsing?

the home page body is truncated to the first 64KB before head-parsing, so if the `<head>` closes past byte 65536 the tool will silently miss it. in practice heads are almost always in the first 8-16KB. the robots.txt body is truncated to 4KB for display purposes. the sitemap body is not truncated for parsing — we need to count all `<loc>` tags.

why is the whole check `error` if robots is missing, but `ok` if sitemap is missing?

it isn't — both robots and sitemap are treated as optional. a 404 on either is fine; the check stays `ok`. the check only errors if the *home page* fetch (`GET https://<domain>/`) itself fails — DNS failure, TCP refusal, or the request body throws mid-read. in that case the other two signals are meaningless because we have no head to summarise.

why doesn't the tool send my cookies or custom user-agent?

we send a fixed `User-Agent: drwho-dossier/1.0 (+https://drwho.me)` and no cookies. sites that serve a different home page to logged-in users or that bot-detect on the UA will show you the anonymous-crawler view, which is also what search engines and social-preview bots see. so this view is what the public web sees, which is usually what you want for an SEO dossier.

Related tools

dns records lookup — resolve A, AAAA, NS, SOA, CAA, and TXT records for a domain in one go.
security headers checker — inspect the response headers served at https://<domain>/ — HSTS, CSP, X-Frame-Options, etc.
redirect checker — trace the HTTP(S) redirect chain from https://<domain>/ up to 10 hops.

References

ad slot · tool-web-surface-inspector