Handy commands with notes on how to use them when logged into the search server's terminal.
We now report more actionable failures in the scrape logs. We've had this for 3 days so we look at the last 12 logs and tally fails by site.
cd logs cat `ls -t | head -12` |\ grep sitemap: |\ sort | uniq -c | sort -n
Look for retired sites that somehow got indexed again. Maybe we can catch this when it happens?
ls retired |\ while read r; do ls -ld sites/$r; done 2>/dev/null
Leading spaces in a title show up as files that look like a command line option in shell. A prevalent problem: 157.
cd sites ls -d */pages/-* | wc -l
We find the top ten of the 35 exhibiting sites.
ls -d */pages/-* |\ cut -d '/' -f1 |\ sort | uniq -c | sort -nr
69 hexa.viki.wiki 16 wiki.ralfbarkow.ch 14 lua.dojo.fed.wiki 12 dreyeck.ch 9 found.ward.bay.wiki.org 3 don.noyes.asia.wiki.org 2 uvp.viki.wiki 2 roots.ward.bay.wiki.org 2 marc.tries.fed.wiki 2 lfi.wiki.dbbs.co