HTTPS surface reachable (robots ✓, sitemap ✗, title ✓)
Why it matters: Public files — robots.txt, sitemap.xml, head meta — are what attackers see first during reconnaissance. Misadvertised paths, stale sitemaps, and verbose generators leak more than intended (ISO 27001 A.8.9).
robots.txt
present
# NINE ENTERTAINMENT CO. POLICY STATEMENT
# Nine Entertainment Co expressly prohibits the use of any Nine
# content or data, including associated metadata, for any machine
# learning and/or artificial intelligence including for the purposes
# of training or development of AI technology, tools and machine
# learning language models.
# View our terms of use - https://login.nine.com.au/terms?client_id=smh
# Sitemaps
Sitemap: https://www.smh.com.au/sitemaps/news/brands/smh
Sitemap: https://www.smh.com.au/sitemaps/smh-sitemaps-videos.xml
Sitemap: https://www.smh.com.au/sitemaps/smh-sitemaps-articles.xml
Sitemap: https://www.smh.com.au/rss/feed.xml
# -----------------------------------------------------------------
# 1. GENERAL CRAWLER RULES (Allowing standard search engines)
# -----------------------------------------------------------------
# All visitors
User-agent: *
Allow: /
Disallow: /search?text=*
Disallow: *?app=*
Disallow: *?do=*
Disallow: *?ocid=*
Disallow: *?ref=*
# -----------------------------------------------------------------
# 2. SPECIFIC BLOCKS FOR AI, LLM, AND DATA-SCRAPING AGENTS
# -----------------------------------------------------------------
##########
# Google AI Agents (Allows standard Googlebot to continue crawling)
User-agent: Google-CloudVertexBot
Disallow: /
User-agent: Google-Extended
Disallow: /
##########
# OpenAI
User-agent: ChatGPT-User
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: OAISearch
Disallow: /
##########
# Anthropic
User-agent: anthropic-ai
Disallow: /
User-agent: claude-web
Disallow: /
User-agent: claudebot
Disallow: /
##########
# Meta (Facebook/LLaMA)
User-agent: facebookbot
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: meta-externalfetcher
Disallow: /
##########
# Apple
User-agent: applebot-extended
Disallow: /
##########
# Perplexity AI
User-agent: perplexitybot
Disallow: /
##########
# Cohere
User-agent: cohere-ai
Disallow: /
##########
# You.com
User-agent: youbot
Disallow: /
##########
# Amazon
User-agent: amazonbot
Disallow: /
##########
# Alibaba Cloud
User-agent: aliyunsecbot
Disallow: /
##########
# Audigent
User-agent: audigentadbot
Disallow: /
##########
# Awario
User-agent: awariorssbot
Disallow: /
User-agent: awariosmartbot
Disallow: /
##########
# BLEX AI
User-agent: blexbot
Disallow: /
##########
# ByteDance
User-agent: bytespider
Disallow: /
##########
# Common Crawl
User-agent: ccbot
Disallow: /
##########
# DataForSEO
User-agent: dataforseobot
Disallow: /
##########
# Diffbot
User-agent: diffbot
Disallow: /
##########
# DuckDuckGo
User-agent: duckassistbot
Disallow: /
##########
# Echobox
User-agent: echoboxbot
Disallow: /
##########
# Friendly Technologies
User-agent: friendlycrawler
Disallow: /
##########
# Internet Archive / "Wayback Machine"
User-agent: ia_archiver
Disallow: /
##########
# ImageSift
User-agent: imagesiftbot
Disallow: /
##########
# MyCentralAI
User-agent: mycentralaiscraperbot
Disallow: /
##########
# NewsNow
User-agent: newsnow
Disallow: /
##########
# News-Please (Open-source)
User-agent: news-please
Disallow: /
##########
# Omgili
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: webzio-extended
Disallow: /
##########
# Peer39
User-agent: peer39_crawler
Disallow: /
User-agent: peer39_crawler/1.0
Disallow: /
##########
# QuillBot
User-agent: quillbot.com
Disallow: /
##########
# Quora
User-agent: quora-bot
Disallow: /
##########
# Scrapy (Open-source)
User-agent: scrapy
Disallow: /
##########
# Seekr
User-agent: seekrbot
Disallow: /
##########
# Seznam.cz
User-agent: seznamhomepagecrawler
Disallow: /
##########
# TaraGroup
User-agent: taragroup intelligent bot
Disallow: /
##########
# Timpi
User-agent: timpibot
Disallow: /
##########
# Turnitin
User-agent: turnitinbot
Disallow: /
##########
# Others
User-agent: viennatinybot
Disallow: /
User-agent: jetslide
Disallow: /
User-agent: magpie-crawler
Disallow: /
User-agent: poseidon research crawler
Disallow: /
head
- title
- Australian Breaking News Headlines & World News Online | SMH.com.au
- description
- —
social
no OpenGraph or Twitter meta tags found