I had to parse an access_log of a website, in order to generate a sitemap. More precisely, a list of all URLs in the site. After playing around I’ve found a solution using sed
, grep
, sort
and uniq
. The good thing that each of this tools is available by default on most Linux distributions.
I had the access log file under access_log
(if you have it under different name/location just substitute it in the following code. My first attempt parsed out all the URLs which where accessed by POST or GET and sorted the output.
sed -r "s/.*(GET|POST) (.*?) HTTP.*/\2/" access_log | sort
After doing so, I turned out that I don’t need the the query string (the part after the ‘?’ in the url) and I can discard URLs consisting only of ‘/’. So I altered the code to be:
sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^\?]*?)(\?.*?)? HTTP.*/\2/" \
access_log | grep -v "^/$" | sort
This time I also took care of URLs access by other ways than just POST and GET.
After I got this list, I thought it would be nice to have all the duplicate URLs stripped out. A quick search turned out that there is a nice command-line utility called uniq
that does just that and is part of the coreutils
package.
sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^\?]*?)(\?.*?)? HTTP.*/\2/" \
access_log | grep -v "^/$" | sort | uniq -u
So the final solution uses sed
to take out the URL part that I wanted. grep
discards URLs consisting of only ‘/’. sort
and uniq
sort out the results and dumps all the duplicate lines.
It’s nice how one can integrate different command line utilities to do this task in one-liner.