Stone Steps Inc.

Article ID: Q20051226-01

Q: How Google Analytics compares to log file analysis tools, such as Stone Steps Webalizer?

A:Web site usage reports generated by various analysis tools are so much different from each other, that it only makes sense to compare methods of collecting website usage data, but not the actual reports. Let's first review how usage data is collected in each case.

Log Files

All servers, including web servers, produce log files that are usually used for security and traffic analysis purposes. A typical sequence of requests and responses resulting in a log entry added to the log file is shown on the picture below. At some point the log file is processed with a log file analysis tool, such as Stone Steps Webalizer, in order to generate web site traffic analysis reports.

Log file format differs greatly from a server to a server, but in general, most log files contain the information that identifies the visitor (e.g. IP address, user name, cookies), the requested resource (e.g. the request URL, including query strings), the type of the user agent (e.g. browser or spider type), the referring page (e.g. the URL of a page containing a link to the requested page), and some other data, such as the request method, request processing time, response size, etc.

Client-Side Scripting

Google Analytics, on the other hand, uses client-side scripting in order to track activity of a web site. In a typical scenario shown on the picture below, a visitor requests a page from the monitored website (1) and receives back a page (2) with an embedded link to a JavaScript file located on one of the Google Analytics servers. The browser requests the script file (4) and executes the code inside. The JavaScript code forms a brand-new HTTP request for a tiny image file, embeds all analysis data into the image URL and sends this request to one of the Google Analytics servers (5). The server that receives this request decodes submitted data and updates the database for the website identified in the request (6).

Google Analytics

In order for this technology to work, all pages of the monitored website must embed a link to the JavaScript file located on one of the Google Analytics servers and a few lines of JavaScript code that activate the functionality of the linked script. A simplified version of such link along with the related JavaScript code is shown below.

<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"></script>
<script type="text/javascript">
_uacct = "XY-12345-6";
urchinTracker();
</script>

The decoded and simplified for display purposes URL generated by the JavaScript code is shown below. The actual URL contains additional bits of data that make it more difficult to alter URL data or generate an arbitrary URL.

/__utm.gif?
    utmwv=1                                              -- version
    utmsr=1024x768                                       -- screen resolution
    utmsc=32-bit                                         -- color depth
    utmul=en-us                                          -- browser language
    utmje=0                                              -- JavaScript flag
    utmfl=8.0 r22                                        -- Flash flag
    utmhn=host-name                                      -- web site host name
    utmr=http://www.google.ca/search?q=search+phrase     -- referrer
    ie=utf-8                                             -- character set
    oe=utf-8                                             -- character set
    client=firefox-a                                     -- user agent type
    rls=org.mozilla:en-US:official                       -- user agent release
    utmp=/page.php?query                                 -- requested URL
    utmac=XY-12345-6                                     -- account ID

Because this URL points to an image file located at www.google-analytics.com, it is easy to stay invisible to Google Analytics by blocking requests for this file or to this domain.

Usage Data

Website traffic analysis can only be as good as the logged data is. The table below lists side-by-side data items that can collected by each of the described methods. A blue checkmark indicates that this data item may be collected by the respective method. A gray checkmark indicates that while technically it is possible to collect this data item, some additional work is required in order to do so. A missing checkmark indicates that this data item cannot be collected by the respective method.

Log FileClient-Side Script
Time Stamp
Client IP Address
User Name 
User Agent
Referrer
HTTP Method 
HTTP Status 
URL
Cached URL
Non-page URL 
TCP Port
Request Size 
Response Size 
Processing Time 
Screen Resolution
Color Depth
JavaScript Detection
Flash Detection
Language Detection

In general, web server log files contain information about the actual traffic served by the web server, while the method based on client-side scripting collects data about the actual traffic generated by the client. These two types of traffic are not the same because of the various caching devices and/or components located between the client and the server (e.g. if the requested page is served from the browser cache, it will not be logged, but the script on the page will still be executed).

Stone Steps Webalizer does not provide any support for processing data items marked with gray checkmarks, outside of the standard analysis functionality. That is, if a page (e.g. /page.html) contains some usage analysis script that sends a special request to the website monitored with Stone Steps Webalizer (e.g. /__a.gif?url=/page.html&lang=ja), this script-generated URL will be logged by the web server and processed by Stone Steps Webalizer as a standard URL and no additional correlation (i.e. that /page.html was requested by a Japanese visitor) will be made between the original page and the additional client-script-provided information about this page.

Invisibility

One important quality of the server-side logging is that all website activity is always being recorded. Of course, it is possible for a visitor to disguise the information that may help the website administrator to identify visitor's whereabouts or the browser configuration (e.g. the IP address, browser type, the referring page, etc), but the administrator will still always know whether somebody is trying to break into the forum administrative interface or hot-link website images or scrape some pages and take preventive measures.

Client-side scripting, on the other hand, does not have this quality. In fact, one does not even need any special software to stay completely invisible to Google Analytics and other client-side scripting-based analysis engines that use fake images to deliver usage data - simply checking Load images ... for the originating website only checkbox in the FireFox configuration (Tools > Options > Content) does the trick.

With this in mind, keep website logging turned on at all times, even if the logs are not being analyzed on the day-to-day basis.

Conclusion

Any client-side technology should be considered unreliable by definition, regardless whether it concerns form validation or website usage analysis. Any client-side technology that is not backed by some form of server-side support (e.g. server-generated digital signatures, etc) should be trusted even less and should be used for entertainment purposes only. Granted, some client-side scripting website usage analysis tools, such as Google Analytics, may provide quite interesting information about website visitors, however, such tools should be considered as a useful addition to the usual everyday log-based analysis, not as a replacement of it.