Generating URL List from Access Log (access_log)

I had to parse an access_log of a website, in order to generate a sitemap. More precisely, a list of all URLs in the site. After playing around I’ve found a solution using sed, grep, sort and uniq. The good thing that each of this tools is available by default on most Linux distributions.

I had the access log file under access_log (if you have it under different name/location just substitute it in the following code. My first attempt parsed out all the URLs which where accessed by POST or GET and sorted the output.

sed -r "s/.*(GET|POST) (.*?) HTTP.*/\2/" access_log | sort

After doing so, I turned out that I don’t need the the query string (the part after the ‘?’ in the url) and I can discard URLs consisting only of ‘/’. So I altered the code to be:

sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^\?]*?)(\?.*?)? HTTP.*/\2/" \
access_log | grep -v "^/$" | sort

This time I also took care of URLs access by other ways than just POST and GET.

After I got this list, I thought it would be nice to have all the duplicate URLs stripped out. A quick search turned out that there is a nice command-line utility called uniq that does just that and is part of the coreutils package.

sed -r "s/.*(GET|POST|HEAD|PROPFIND) ([^\?]*?)(\?.*?)? HTTP.*/\2/" \
access_log | grep -v "^/$" | sort | uniq -u

So the final solution uses sed to take out the URL part that I wanted. grep discards URLs consisting of only ‘/’. sort and uniq sort out the results and dumps all the duplicate lines.

It’s nice how one can integrate different command line utilities to do this task in one-liner.

Leave a Reply

Your email address will not be published. Required fields are marked *