A Comparative Analysis of Browsing Behavior of Human Visitors and Automatic Software Agents

Veiw figure View Figure

3.1. Data Set Description

Web server access logs: Web log file consists of four kinds of records: access log, error log, referrer log and agent log. The forms of web log file are usually of two types, common log file (CLF) and extended log file (ELF). A web server access log store the detailed information of each request made from user’s web browsers to the web server in a chronological order ^[18]. An example of classic web server log entries is given as follows and brief description in Table 1.

11.111.11.111 - - [15/Dec/2013:00:01:02 -0800] "GET/forum/member.php?45067-Carla-Zenis&tab=activitystream&type=all HTTP/1.1"200 10463"http://www.google.com/bot.html” "Mozilla/5.0 (compatible; Googlebot/2.1)"

Table 1. Brief description of web log entry headers

Download as

View current table in a new window

Tables index

Veiw figure View Table

View next table

3.2. Pre-Processing

This log has parsed and pre-processed with the help of customized program which is based on an open source tool ^[19]. We extract various statistical log information like number of requests, request duration, number of users, page hits, domains and countries of host visitors, used operating system, used browser, robots activity, HTTP errors and many more from given log.

3.3. Identification of Robot and Human Sessions

Web robot sessions are extracted by using multi fold approaches in first step we applied well known heuristics proposed in ^[2] but this has left number of robot sessions without identification. So in second step our log analyzer uses the database of IP addresses and user agent fields of well known bots ^{[20, 21]}. if the web serve log session’s IP addresses or user agent is matches with IP or user agent of well known crawlers then session is labelled as web robot sessions and it effectively obtains a sizable sample of requests from web robots to infer significant trends. Human visitor’s sessions are identified by using time oriented heuristics. The time-oriented heuristic can be two types: the session-duration heuristic and page-stay time heuristic. These two time-oriented heuristics can be used individually or jointly to segment the user activity log into sessions. Each heuristic h scans the user activity logs to which the web server log is partitioned after user identification, and outputs a set of constructed sessions:

• h1 (session-duration heuristic): Total session duration may not exceed a threshold θ. Given t0, the timestamp for the first request in a constructed session S, the request with a timestamp t is assigned to S, iff t − t₀ ≤ θ. This heuristics varies from 25.5 minutes to 24 hours while 30 minutes is the mostly used default timeout for session duration ^[22].This information is summarized in Table 2.

Table 2. web server access log of twenty day’s duration

Download as

View current table in a new window

Tables index

Veiw figure View Table

View previous table

4. Experiments

In this section we perform various experiments to draw the comparison between access behaviour of human visitors and web robots. As shown in Table 2, very large number of requests, users and sessions are generated for the web server logs. To capture microscopic view and avoid processing overhead for this analysis we are using sample of this log to show comparisons.

4.1. Experiment-I

Comparison of Resource acquisition pattern:- Here we analyze and compare the percentage of requests, percentage of visitors and percentage of bandwidth consumed by human and robots for each specific types of resources.. The most striking observation to emerge from the data comparison (Figure 2) was that robots exhibit their aggression to only access web resources (*.html, *.php, *.htm etc.) and engrossed more number of requests, visitors and consumed more bandwidth as compare to human visitors. While Human visitors show uniform access behaviour for all type of resources and receive less number of requests and visitors and consumed less bandwidth for web resources but aggregate value (including all type resources) is quite high as compared to robots. Interestingly, there were also differences in the ratios of web to image resources accessed by humans and robots. This value is very large for robots than humans. It is reasonable to expect humans to request many web resources as they browse from page to page to retrieve information and download files. But this percentage may be low because humans’ liking may be twisted towards embedded resources (*.jpg, *.png,*.gif etc.) with web pages.

Figure 2. % of Requests Received, % of Visitors Attracted and % of Bandwidth consumed by different types of resources

Download as

Veiw figure View Figure

Figure 3. Comparison access behavior of Web robots Vs Humans

Download as

Veiw figure View Figure

4.2. Experiment-II

Comparison of general browsing behaviour: - In this experiment, we examine the hourly distributions of % of requests, % of visitors and % of bandwidth consumed by robots and humans.

We also examined the access behaviour of robots and humans for most popular resources, top entry pages and top exit pages. It is evident from the experiment (Figure 3) that the human visitors exhibit a consistent access tendency throughout the day but robots initially (red spikes) generates vast amount of traffic to request large number of resources and consumed significant amount of band width. Popular resources are accessed by robots and humans are localized in different localities but human visitors are monotonous and restricted to few web resources while robots perform exhaustive search for multiplicity of resources and followed diverse entry and exit paths.3.

Figure 4. Comparison of demographic browsing behaviour of Web robots Vs Humans

Download as

Veiw figure View Figure

4.3. Experiment-III

Comparison of Demographic browsing behaviour: - In this analysis we are examining the demographic origin of robots and human visitors (Figure 4). Largest share of visitors sessions (both robots and humans) are credited to USA and followed by India. China is the only country who had made significant contribution to web robots sessions but very small share of human visitors. Among global cites most of the robot sessions are originated from Beijing, Mountain View, Chicago and Singapore etc. while human sessions are generated across the world but major share contributed from Indian cities (Hyderabad, Bangalore, Mumbai etc.) where Hi-tech industries are blooming, USA cities (San Francisco, Loss Angeles, Chicago etc.), Singapore and Dubai. In India both robots and human sessions are generated from almost same cities but quantum of human sessions is much more than robot sessions.

4.4. Experiment-IV

Comparison of Access paths and Response codes:- Here we will discuss the path followed, responses received and operating systems used by web robots and human visitors (Figure 5). Human visitors followed diverse paths and most of the time received successful response from the sever as compared to robots.web robots used very long paths and generates high frequency for these paths because they are mechanised to do the same job repeatedly.web robots also receive different types of erroneous responses from servers because they are automatic software agents may follow the broken links on web pages or try to get resources which are not available. Both human and robots used different types of OS but humans are dominated in GUI based desktop OS systems while robots used server OS along with GUI based desktop OS.

Figure 5. Comparison of Access paths, Response codes and Operating Systems

Download as

Veiw figure View Figure