SyntaxHighlighter

Thursday, July 17, 2014

New in Electra: RIDE

I have just added the following digital resource to the Electra Atlantis feed aggregator

title = RIDE: A review journal for digital editions and resources
url = http://ride.i-d-e.de/
creators = Alexander Czmiel, et al. (eds.)
license =  http://creativecommons.org/licenses/by/4.0/
description = RIDE is a review journal dedicated to digital editions and resources. RIDE aims to direct attention to digital editions and to provide a forum in which expert peers criticise and discuss the efforts of digital editors in order to improve current practices and advance future developments. It will do so by asking its reviewers to pay attention not only to the traditional virtues and vices of any edition, but also to the progressing methodology and its technical implications.
feed = http://ride.i-d-e.de/feed/

Wednesday, July 16, 2014

New in Electra: EpiDoc Workshop

I have just added the following blog to the Electra Atlantis feed aggregator:

title = EpiDoc workshop
url = http://epidocworkshop.blogspot.co.uk/
creators = Simona Stoyanova, et al.
description = Share markup examples; give and receive feedback
keywords = EpiDoc, epigraphy, inscriptions, XML, TEI
feed = http://epidocworkshop.blogspot.com/feeds/posts/default?alt=rss

Monday, July 14, 2014

Hacking on Apache Log Files with Python

There are plenty of tools out there for doing web request analysis. I wanted to pull some information out of some Apache log files without all that overhead. Here's what I did:

I got Rory McCann's apache-log-parser (after some googling; it's on Github and on pypi).  I set up a Python virtual environment using Doug Hellmann's virtualenvwrapper, activated it, and then used:

 pip install apache-log-parser  

Since I'd never used apache-log-parser before, I had to get familiar with it. I discovered that, to use it, I had to know the format string that Apache was using to log information for my site. This was a two-step process, figured out by reading the Log Files section of the Apache documentation and poking about with grep.

First, I searched in the Apache configuration files for the CustomLog directive that's associated with the virtual host I wanted to analyze. This gave me a 'nickname' for the log configuration. More spelunking in Apache config files -- this time in the main configuration file -- turned up the definition of that nickname (Apache uses the LogFormat directive for this purpose):

 $ cd /etc/apache2/  
 $ grep CustomLog sites-enabled/foo.nowhere.org  
  CustomLog /var/log/apache2/foo.nowhere.org-access.log combined  
 $ grep combined apache2.conf   
 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined  

It's that LogFormat string that needs to be given to Rory's code so it knows how to parse your log files.

After some experimenting in the Python interpreter to get a feel for the code and its capabilities, I wrote a few lines of my own to wrap the setup and file reading operations:

 #!/usr/bin/env python  
 # -*- coding: utf-8 -*-  
   
 import apache_log_parser  
 import glob  
 import logging  
   
 # supported log file formats  
 APACHE_COMBINED="%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""  
 APACHE_COMMON="%h %l %u %t \"%r\" %>s %b"  
   
 def gulp(log_file_path, pattern=APACHE_COMBINED):  
   """ import and parse log files """  
   log_data=[]  
   line_parser=apache_log_parser.make_parser(pattern)  
   for file_name in glob.glob(log_file_path):  
     logging.info("file_name: %s" % file_name)  
     file = open(file_name, 'r')  
     lines = file.readlines()  
     file.close()  
     logging.info(" read %s lines" % len(lines))  
     for line in lines:  
       line_data=line_parser(line)  
       log_data.append(line_data)  
   logging.info("total number of events parsed: %s" % len(log_data))  
   return log_data  

For this particular server, I had multiple log files, but I wanted to have all the requests from all of them (parsed into dictionaries by Rory's code) in a single list for subsequent analysis. So, back to the Python interpreter:

 >>> import logging  
 >>> logging.basicConfig(level=logging.INFO)  
 >>> import loggulper  
 >>> d = loggulper.gulp("/path/to/my/log/files/foo.nowhere.org-access.*")  

I'll spare you the logging messages. This took several minutes. I ended up with about 1.5 million requests in the list. Real life intervened. How to save this data for later without having to run through the whole import process again? Pickle it.

 >>> import cPickle as pickle  
 >>> out_name = "logdata.pickle"  
 >>> outf = open(out_name, 'w')  
 >>> pickler = pickle.Pickler(outf, pickle.HIGHEST_PROTOCOL)  
 >>> pickler.dump(d)  
 <cPickle.Pickler object at 0x1044044e8>  
 >>> outf.close()  

The whole list was saved to a 933.3 MB file in just a few seconds (full disclosure: I have a solid-state drive). It was nearly as quick to read back in a couple of days later (new interpreter session and all):

 >>> import cPickle as pickle  
 >>> in_name="logdata.pickle"  
 >>> inf=open(in_name, 'r')  
 >>> unpickler=pickle.Unpickler(inf)  
 >>> d=unpickler.load()  
 >>> len(d)  
 1522015  
 >>> d[0].keys()  
 ['status', 'request_header_referer', 'remote_user', 'request_header_user_agent__browser__family', 'request_header_user_agent__is_mobile', 'request_header_user_agent__browser__version_string', 'request_header_user_agent', 'request_http_ver', 'request_header_user_agent__os__version_string', 'remote_logname', 'time_recieved_isoformat', 'time_recieved', 'request_first_line', 'request_header_user_agent__os__family', 'request_method', 'request_url', 'remote_host', 'time_recieved_datetimeobj', 'response_bytes_clf']  

It's important to notice at this point that the word "received" is misspelled "recieved" in keys in the dictionaries returned by apache-log-parser. If you don't notice this early on, it will cause some amount of frustration.

It turned out that my log data included events past the end of reporting period I'd been given (ending 31 May 2014), so I needed to filter out just those requests that fell within the reporting period. Python list comprehensions to the rescue:

 >>> dates=[req['time_recieved_datetimeobj'] for req in d]  
 >>> max(dates)  
 datetime.datetime(2014, 7, 13, 11, 37, 31)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  
 >>> from datetime import datetime  
 >>> d_relevant=[req for req in d if req['time_recieved_datetimeobj'] < datetime(2014,06,01)]  
 >>> dates=[req['time_recieved_datetimeobj'] for req in d_relevant]  
 >>> max(dates)  
 datetime.datetime(2014, 5, 31, 23, 59, 17)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  

Now to separate requests made by self-identified bots and spiders from the rest of the traffic:

 >>> robots=[req for req in d_relevant if 'bot' in req['request_header_user_agent'].lower() or 'spider' in req['request_header_user_agent'].lower()]  
 >>> len(robots)  
 848450  
 >>> humans=[req for req in d_relevant if 'bot' not in req['request_header_user_agent'].lower() and 'spider' not in req['request_header_user_agent'].lower()]  
 >>> len(humans)  
 486249  
 >>> len(robots) + len(humans) == len(d_relevant)  
 True  
 >>> unique_bots=[]  
 >>> for bot in robots:  
 ...   if bot['request_header_user_agent'] not in unique_bots:  
 ...     unique_bots.append(bot['request_header_user_agent'])  
 ...   
 >>> len(unique_bots)  
 229

Aside: yes, I know there could well still be automated agents in the "humans" list; I've only filtered out those that are not operated by the sneaky or the uninformed. Let's not worry about that issue for now.

Stay tuned for the next installment, wherein I hope we actually learn something about how our server is being used.