Printer Friendly

Internet Web Site Geographical Counter.

James Etheredge [1]

The Internet provides many interesting challenges, as its usage becomes more common in the consumer marketplace. Most forms of media provide some means of obtaining profiles of the people using that media. The purpose of the project described in this paper was to develop an Internet web page counter that provides a geographical representation of the people visiting the site. This information can be useful to the businesses hosting the site as well as advertisers supporting the site. Very little information is available about the particular user downloading a web page. Hypertext Transfer Protocol (HTTP) provides a means of obtaining personal information about the user, such as their e-mail address, through the request header but for reasons of privacy, commercial programs generally do not implement those elements of the protocol. One piece of information, which is available, is the user's Internet domain. This is the piece of information used to determine the person's geographical location. For people using na tional domains, such as America On-Line and CompuServe, however, no information can be obtained regarding that particular person's location. This program will determine the person's location approximately 70% of the time, based on testing performed to date. The output of the program is a map of the United States showing the number of hits by state. In addition to the development of a new Internet application the project served as a vehicle for the exploration and utilization of complex and often poorly documented networking concepts.

Virtually every form of media has some means of obtaining information about the people viewing that media. The Internet provides some interesting challenges in this regard. As commercial interests on the Internet, such as advertising and electronic commerce, continue to increase, so does the demand to obtain information about the users of the media. Several mechanisms are available for obtaining user information, each with its advantages and disadvantages.

One method of tracking users is by using "cookies." A cookie is an ASCII text string, which is set in the user's computer. By using a Common Gateway Interface (CGI) program, the text string can be sent and stored on the client machine as well as retrieved from the client. The notion of allowing a server to install a file onto a user's computer, often times without the user's knowledge, has caused many debates between the government and the companies writing the commercial software. The two leading web browser developers, Netscape and Microsoft, claim it is not an invasion of privacy because the user has the option of not accepting cookies. The government's claim is that the user is not given any indication that the server is installing data onto their computer because the default setting in both browsers is to accept all cookies without warning. For security reasons, cookies have several restrictions:

* They can contain only ASCII text. This prevents the spread of viruses.

* Only one cookie may be installed from each domain.

* The size is limited to 4096 bytes.

Several types of information can be gathered using cookies, such as tracking how a user navigates through the web site and tracking how often the user visits the site. An advertiser or web developer may be more interested in the number of repeat visits to the site, instead of a raw count number.

Another method of tracking users is through a registration process. The user may be presented with an HTML form, which is used to gather information about the user. The form contents can be passed to a program on the server that assigns the user a password. Sometimes this can be effective if the information is of sufficient value that the user will register. Often times though, the user will go seek the information elsewhere or put invalid information into the form.

The project described in this paper is another means of obtaining information about the user. Geographical information can be used by web hosting companies and advertisers alike to better understand their audience.

OVERVIEW

The geographical counter is implemented as a CGI program. Several methods are available for calling CGI program. Felton (1997), Schwartz and Christiansen (1997), Wall et al. (1996), and Medinets (1996) provide detailed explanations and examples of different methods and techniques used in CGI programs. In this case, the program is called from an image tag on the web page such as:

[less than]IMG SRC= "http://www.domain.com/cgi-bin/geo"[greater than]

The CGI program is started when the client browser requests the "Image" described in the tag. The path in the image tag is actually the path to a CGI program. Since the browser thinks it is retrieving an image, the CGI program has two methods of providing an appropriate response. The method used in this program is to open an image file and send it to the client once the connection is established. Another method is to return an HTTP redirection header. This type of header is used to redirect the client browser to an alternate location to obtain the information. The redirection header has the format:

Location: http://www.hostdomain.com/image.gif

The CGI program retrieves the user's Internet domain from the server environment. When the HTTP request is made from the client browser to the server, an environment variable called "REMOTE_HOST" is sent to the server. Retrieving this variable returns a string representing), the user's domain, for example, "datasync.com."

All domains on the Internet are registered and publicly available from the InterNIC, which is the organization in the United States responsible for domain registration. Each domain name must be unique in order to avoid conflicts. The program opens a socket to the appropriate InterNIC server to obtain the registration information for the domain. The data received from InterNIC is parsed to obtain the state from which the domain is registered. The parser looks for the following sequence of characters:

[less than]comma[greater than][less than]space[greater than][less than]2 uppercase letters[greater than][less than]space[greater than][less than]5 digits[greater than]

This would represent a sequence like ", VA 22089." The two uppercase letters represent the state and the five digits represent the zip code. A counter for the state is incremented in a local file. Figure 1 shows the response from the InterNIC server when it was queried with "datasync.com."

A separate utility program is used to display the counter results. This also runs as a CGI program and reads the count numbers for each state. A map of the United States is displayed to the user with dots representing the location and number of hits. The dots are sized according to the number of hits for that state. The map is displayed in a Java applet window to allow the dots to be dynamically drawn. Harold (1996) and Falnagan (1997) provide information on how Java applets execute within the context of a web browser. A table is also displayed giving the actual numbers. Refer to Figure 2 for a sample of the program output.

FUNCTIONAL DESCRIPTION

Refer to Figure 3 for the functional description of this program. This model assumes that the counter program and web site are running on different network servers. This model was chosen for practical reasons. Many commercial Internet Service Providers (ISP's) do not allow any type of customer installed executable, other than those that are made public by the system administrator. This model allows the program to be installed on one system, but can be accessed by any number of web sites. The name of the web site making the request can be passed to the program in the QUERY_STRING environment variable. Six steps are required to obtain the users location as follows:

(1) URL Entry: The process starts when the user enters a URL into the web browser or the user selects a hyperlink from another web page. The client browser sends a request for the HTML coded page.

(2) Server Response: The web site server responds by sending the HTML coding for the page. The HTML coding provides the web browser with the formatting information for the page. This will typically include such items as text, tables, frames, images, links, etc.

(3) Call CGI Program: Once the HTML coding is downloaded, the browser makes additional requests based on the coding. This will typically include items like images. If a web page has 10 different images, the browser must make 10 separate connections to the server to obtain the images (assuming they have not been previously downloaded). The images don't necessarily have to reside on the same server where the HTML coding originated. For this particular model, one of the image tags tells the browser to go to a different server to obtain the image. This causes the browser to make a request to the server where the geographical counter program resides. At this point, the browser is simply making a request for an image, except the path provided by the image tag is actually to a CGI program, not an image file. This causes the CGI program to instantiate and the server will attach the standard output file handle of the CGI program to the socket connection back to the browser. The browser will receive the output of the COI program.

(4) CGI Program Response: When the CGJ program is instantiated, it retrieves the user's domain name from the environment. The domain name will be an ASCII string like "datasync.com." The server will set the domain as an environment variable when the connection is established. The client browser is waiting for the image file so the CGI program opens an image file and sends it to the client. The image is a 1 x 1 pixel, transparent bitmap. The image will not be visible in the client browser. The purpose of the image tag is to force a request to the server where the counter resides. If the server fails to return an image or a redirection, then the client machine will display an error to the user. Once the image data is returned to the client, the server can proceed to the next step, which is to check the user's geographical location.

(5) Query InterNIC: Once the CGI program is instantiated, the user's domain is retrieved from the local environment. The server places the domain string into the environmental variable "REMOTE_HOST," which can be retrieved by a call to the C language runtime function getenv(). The domain string will be of the form "datasync.com." The program queries the InterNIC server with the domain name. A TCP socket is opened to port 43 on the InterNIC server, however, the appropriate server to query depends on the domain suffix.

(6) InterNIC Response: The InterNIC server responds with the registration information about the domain, including the city and state where it is registered. This should only be used for determining the general geographical area. Most ISP's operate in a region that may cover a 50 mile radius, with hubs located in various locations to provide local dial-up numbers. The query to InterNIC will only return the city and state where the company is registered. Refer to Figure 2 for a sample of the response from the InterNIC server. The program retrieves the first 255 bytes of data, then parses it to obtain the state where the domain is registered. The program opens a binary file and updates the count for that state. Since the CGI program is instantiated for each request, it is possible to have multiple instances updating the same file. If the data is considered critical, then file locking can be used to maintain data integrity. This introduces significant overhead and should only be used if necessary. Once the count file is incremented, the program terminates.

Several steps are involved in implementing the counter. It may appear somewhat cumbersome; however, the process is relatively quick. Typically, it will take about 2 seconds from the time the client obtains the HTML coding until the counter is incremented. Typically, the longest delay is waiting for the InterNIC server to respond. The whole process is invisible to the user. The web server where the counter is located responds back to the client by sending a 1 x 1 pixel, transparent image. This is done prior to connecting to the InterNIC server to check the user's geographical location. With this method, the user would not experience any noticeable delay. The only purpose of the image is to provide a means of instantiating the CGI program. It should be noted that the process of connecting to the InterNIC could be performed off-line. Server logs could be used to obtain the domain information. Which method is best depends on the business application. Off-line processing could be done during non-peak hours at the expense of real-time data collection.

The program opens a socket to one of three InterNIC servers to obtain the registration information as follows:

Each server uses port 43 for the registration information. The sockets used in this program are non-persistent. If an error occurs while connecting, sending or receiving information through the socket, then the socket is closed and the program terminates. Also, if the InterNIC server is busy at the time the query is made, the program aborts the attempt and terminates.

This program could be easily extended to determine international domain locations. At this time, however, all domains outside the United States are logged under one category called "foreign."

More persistent sockets can introduce large amounts of overhead, which is not necessary for this application. Any errors in connecting sockets are related to the host server talking to the InterNIC server and have nothing to do with the domain of the user accessing the page. With this line of reasoning, making the sockets persistent will not provide any more accurate data on the physical location of users visiting the web site.

IMPLEMENTATION

Consideration has to be given to the load put on the server to run this type of counter. The purpose is primarily to provide a geographical representation. The total hit count can be obtained through a simple counter program and then compared to a geographical distribution. For a very busy site, the number of users checked can be reduced with some simple algorithms. One method is to use a time constant with the modulus operator to determine whether to check the user domain. In C code, the line of the code would look something like:

If( (t % 4) = 0) {

[less than]check user domain[greater than]

}

In this case, the statement will return true 25% of the time, so 25% of the user domains will be checked. t is the standard time constant representing the number of seconds passed since January 1, 1970.

The counter can be activated and de-activated quite easily by just changing the image tag in the HTML page.

LIMITATIONS

As with any image-based counter the CGI program will not be instantiated if the user is using a text-only browser or the user has set the browser to text-only mode. In order for the CGI program to instantiate, the client browser must make a request for the image. Other technologies are available which overcome this limitation. Active Server Pages or Java serlets could be used to deliver the page provided the Internet host allows such processes to be installed. The image should be placed near the top of the page. This gives it the best chance of being requested from the server. If the calling image is near the bottom of the page, the user may click the "stop" button in the browser before the page and all its components finish downloading.

An alternate method can be used to call the program. The entire web page can be the output of a CGI program. This is referred to as "through-put" mode. The CGI is called from a link to the page. This ensures that the program is always called. Maintenance can be a little more difficult with this approach.

The InterNIC server returns the registration information for the user's domain. In some cases, this does not represent the user's physical location. For this reason, users with national domains cannot be verified. Typical examples of national domains are America On-Line, Microsoft Network or CompuServe. These can be logged separately from the geographical hits. In many circumstances, the number of users from national domains can be a very useful piece of information. People using these services tend to fit a particular demographic profile, which may be of interest to web developers or advertisers.

CONCLUSION

A geographical counter is a tool that can provide a representation of people visiting a web site. This information can be used to determine which geographic areas to target for other forms of media and advertising.

Inherent limitations in the network environment and practical considerations will prevent 100% of the users being checked for geographical location. The goal, however, is to produce a geographic distribution that approximates the location of the people visiting the web site.

The source code for this program may be viewed at http://www.bestweb.net/[sim]pywacket/. The compiled code is not presently installed in a publicly accessible directory.

(1.) Author for correspondence.

LITERATURE CITED

Felton, M. 1997. CGI Internet Programming with C++ and C. Prentice Hall (ISBN 0-13-712358-2). 514 pp.

Schwartz, R.L., and T. Christiansen. 1997. Learning Perl, 2nd ed. O'Reilly and Associates (ISBN 1-56592-284-0). 269 pp.

Wall, L., T. Christiansen, and R..L. Schwartz. 1996. Programming Pen, 2nd ed. O'Reilly and Associates (ISBN 1-56592149-6). 645 pp.

Medinets, D. 1996. Pen by Example. Que Corporation (ISBN 0-7897-0866-3). 58 pp.

Harold, E.R.1996. Java Network Programming. O'Reilly and Associates (ISBN 1-56592-227-1). 422 pp.

Falnagan, D.1997. Java in a Nutshell, 2nd ed. O'Reilly and Associates (ISBN 0-56592-262-X). 610 pp.
COPYRIGHT 2000 Mississippi Academy of Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2000, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Author:Etheredge, James
Publication:Journal of the Mississippi Academy of Sciences
Date:Apr 1, 2000
Words:2971
Previous Article:Editorial.
Next Article:Effects of NaCI and [MgCI.sub.2] on Physiological and Biochemical Changes in Osmoregulation of Chlorococcum hypnosporum L.
Topics:


Related Articles
Gimmee Furniture ... It's Just a Click Away.
Expatriate Oasis.
Modems: a girl's best friend! (Influential Women).
What's a "hit"? an analysis of a web-based learning environment.
Bay watch: a gay auction site finds itself at odds with eBay. (On the Web).
Searching for health info is a growing online activity.
Desktop sharing solutions.

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |