I was trying to explain to a colleague a few days ago how a few shell commands can be really useful, when today I came across an example to try to illustrate. My problem was that I had 245 log files each about 70-80MB in size – roughly 4 million lines in each log file. Each line in the log file uses the following (squid) format:
1378297522.050 4 111.222.111.222 TCP_MISS/200 2600 GET http://somewebste.com/favicon.ico 12586072 DIRECT/111.222.111.222 text/html
Now my problem was that I wanted to examine or graph the number of unique IP addresses seen in each log file per day to give me a rough idea of how many computers have been using the service each day. The reasoning is that I want to check the effect of new computer deployments.
So to get the number of distinct IP addresses per day – a simple shell script and I have csv values I can import into a spreadsheet to graph.
#!/bin/bash
DIR="/var/log-archive/squid/2013"
MONTHS=("01" "02" "03" "04" "05" "06" "07" "08")
for MONTH in ${MONTHS[*]}
do
for DAY in `seq -w 01 31`
do
MYDATE=$MONTH$DAY
if [ -f $DIR"/access_2013"$MYDATE"_combined.log.gz" ]
then
UNIQUE_IPS_SEEN=`zcat $DIR"/access_2013"$MYDATE"_combined.log.gz" | awk '{print $3}' | sort | uniq | wc -l`
echo "2013$MYDATE,$UNIQUE_IPS_SEEN"
fi
done
done
So I reckon it would be hard to find a quicker, friendlier way to solve that problem.