Ten tips for building cache-friendly Web sites. (Internet Focus).In computer science lingo Lingo - An animation scripting language. [MacroMind Director V3.0 Interactivity Manual, MacroMind 1991]. , caching refers to the process of storing frequently used information. Caching is prevalent in many aspects of computer systems and networks. CPUS, disks, file systems, and routers all use caching. A Web cache (1) A computer system in a network that keeps copies of the most-recently requested Web pages in memory or on disk in order to speed up retrieval. If the next page requested has already been stored in the cache, it is retrieved locally rather than from the Internet. is a server dedicated to the task of storing Web pages as people surf the Internet. Internet service providers Internet service provider (ISP) Company that provides Internet connections and services to individuals and organizations. For a monthly fee, ISPs provide computer users with a connection to their site (see data transmission), as well as a log-in name and password. , corporations, and universities often use Web caches--also known as caching proxies--on their local networks to increase download speeds and reduce network traffic. When people surf the Web through the proxies, the cached pages load significantly faster than pages that are not found in the cache. A 1999 study by Zona Research concluded that e-commerce sites may have been losing up to $362 million per month (USD USD In currencies, this is the abbreviation for the U.S. Dollar. Notes: The currency market, also known as the Foreign Exchange market, is the largest financial market in the world, with a daily average volume of over US $1 trillion. ) due to page-loading delays and network failures, Content providers and server administrators should try to build cache-friendly Web sites. Your customers and visitors will thank you. The following list describes the top ten steps you can take to build a cache-friendly Web site. Don't feel you have to implement all of these steps, it is still helpful if you put just one or two into practice. The most beneficial and practical ideas are listed first. 1. Avoid using CGI CGI in full Common Gateway Interface. Specification by which a Web server passes data between itself and an application program. Typically, a Web user will make a request of the Web server, which in turn passes the request to a CGI application program. , Active Server Pages (World-Wide Web, programming) Active Server Pages - (ASP) A scripting environment for Microsoft Internet Information Server in which you can combine HTML, scripts and reusable ActiveX server components to create dynamic web pages. IIS 4. , and server-side includes unless absolutely necessary. In general, these techniques are bad for caches because they usually produce dynamic content. Dynamic content is not a bad thing per se, but it may he abused. CGI and ASP can also generate cache-friendly, static content, but require special effort by the author and seem to occur infrequently in practice. The main problem with CGI scripts is that many caches simply do not store a response when the URL URL in full Uniform Resource Locator Address of a resource on the Internet. The resource can be any type of file stored on a server, such as a Web page, a text file, a graphics file, or an application program. includes cgi-bin or even cgi. The reason for this heuristic A method of problem solving using exploration and trial and error methods. Heuristic program design provides a framework for solving the problem in contrast with a fixed set of rules (algorithmic) that cannot vary. 1. is perhaps historical. When caching was first in use, this was the easiest way to identify dynamic content. Today, with HTTP HTTP in full HyperText Transfer Protocol Standard application-level protocol used for exchanging files on the World Wide Web. HTTP runs on top of the TCP/IP protocol. 1.1, we only need to look at the response headers to determine what may be cached. Even so, the heuristic remains, and some caches might be hardwired to never store CGI responses. From a cache's point of view, Active Server Pages (ASP) are very similar to CGI scripts. Both are generated by the server, on the fly, for each request. As such, ASP responses usually have neither a Last-Modified nor an Expires header. On the plus side, it is uncommon to find special cache heuristics for ASP (unlike CGI), probably because ASP was invented well after caching was in widespread use. Finally, you should avoid server-side includes (SSI (1) See server-side include and single-system image. (2) (Small-Scale Integration) Less than 100 transistors on a chip. See MSI, LSI, VLSI and ULSI. 1. (electronics) SSI - small scale integration. 2. ) for the same reasons. This is a feature of some HTTP servers to parse HTML HTML in full HyperText Markup Language Markup language derived from SGML that is used to prepare hypertext documents. Relatively easy for nonprogrammers to master, HTML is the language used for documents on the World Wide Web. at request time, and replace certain markers with special text. For example, with Apache you can insert the current date and time or the current file size into an HTML page, Because the server generates new content, the Last-Modified header is either absent in the response, or set to the current time. Both cases are bad for caches. 2. USC An abbreviation for U.S. Code. the GET method instead of the POST method, if possible. Both methods are used for HTML forms and query-type requests. With the POST method, query terms are transmitted in the request body. A GET request, on the other hand, puts the query terms in the URI Uri, in the Bible Uri (y `rī), in the Bible.1 Father of Bezaleel (1.) 2 Father of Geber (2.) 3 Porter. (Uniform Resource Identifier “URI” redirects here. For other uses, see URI (disambiguation). A Uniform Resource Identifier (URI), is a compact string of characters used to identify or name a resource. ). It's easy to see the difference in your browsers Location box. A GET query has all the terms in the box, with lots of & and = characters. This means POST is somewhat more secure because the query terms are hidden in the message body. However, this difference also means that POST responses cannot be cached unless Specifically allowed. POST responses may have side effects Side effects Effects of a proposed project on other parts of the firm. on the server (e.g., updating a database), but those side effects wouldn't be triggered if the cache gave back a cached response. Section 9.1 of RFC (Request For Comments) A document that describes the specifications for a recommended technology. Although the word "request" is in the title, if the specification is ratified, it becomes a standards document. 2616 explains the important differences between GET and POST. In practice, it is rare to find a cachable POST response, so I doubt most caching products even cache any POST responses at all. If you want to have cachable query results, you certainly should use GET instead of POST. 3. Avoid renaming Web site files; use unique filenames instead. 4. Give your content a default expiration time Expiration time The time of day by which all exercise notices must be received on the expiration date. Technically, the expiration time is currently 11:59AM on the expiration date, but public holders of option contracts must indicate their desire to exercise no later than 5:30PM on , even if it is very short. This might be difficult or impossible for some situations, but consider this example: A Web site lists a schedule of talks for a conference. For each talk there is an abstract stored in a separate HTML file. These files are named to match the order of their presentation during the conference. Something like talk0l.html, talk02.html, talkO3.html, and so on. At some point, the schedule changes and the filenames are no longer in order. If the files are renamed, so that they match the new order of the presentation, Web caches are likely to become confused. Renaming usually does not update the file- modilfication time, so an If -Modified-Since request for a renamed file can have unpredictable consequences. Renaming files in this manner is similar to cache poisoning. In this example, it is better to use a file-naming scheme that does not depend on the order; perhaps base the file naming on the presenter's name. Then, if the order of presentation changes, the HTML file with the schedule must be rewritten, but the other files can still be served from the cache. Another solution is to touch the files to adjust the time stamp See timestamp. . If your content is relatively static, adding an Expires header can significantly speed up access to your site. The explicit expiration time means clients know exactly when they should issue revalidation requests. An expires- based cache hit Finding and retrieving an instruction or item of data in a cache. Contrast with cache miss. See cache. (storage) cache hit - A request to read from memory which can satisfied from the cache without using the main memory. Opposite: cache miss. is almost always faster than a validation-based near hit. With Apache, you can use the mod-expires module to add expiration times to your responses. After configuring and compiling the server with mod-expires, you'll need to add the ExpiresActive directive to your httpdconf file: ExpiresActive on Then, you can use the ExpiresByType and ExpiresDefault directives to control expiration values for different responses. For example: ExpiresByType text/html "access plus 12 hours" ExpiresByType imageljpeg "access plus 1 day" ExpiresDefault "access plus 6 hours" If you have content that changes at regular intervals (for example, daily), you can base the expiration time on the file-modification time: ExpiresByType image/gif "modification plus 1 day" For more information on the mod-expires module, take a look at the Apache documentation 5. If you have a mixture of static and dynamic content, it is helpful to have a separate HTIP server for each. This way, you can set server-wide defaults to improve the cachability of your static content, without affecting the dynamic data. Since the entire server is dedicated to static objects, you only need to maintain one configuration file. A number of large Web sites have taken this approach. Yahoo serves all its images from a server at images.yahoo.com, as does CNN CNN or Cable News Network Subsidiary company of Turner Broadcasting Systems. It was created by Ted Turner in 1980 to present 24-hour live news broadcasts, using satellites to transmit reports from news bureaus around the world. with images. cnn.com. Wired serves advertisements and other images from static.wiredcom, and Hotbot uses a server named static.hotbotcom 6. Don't use content negotiation. Occasionally, people like to create pages that are customized for the user's browser. For example, Netscape may have a nifty feature that Internet Explorer Microsoft's Web browser, which comes with Windows starting with Windows 98. Commonly called "IE," versions for Mac and Unix are also available. Internet Explorer is the most widely used Web browser on the market. It has also been the browser engine in AOL's Internet access software. does not have. An origin server can examine the User-agent request header and generate special HTML to take advantage of a browser feature. To use the terminology from HTTP, an origin server may have any number of variants for a single URI. The mechanism for selecting the most appropriate variant is known as content negotiation, and it has negative consequences for Web caches. First, if either the cache or the origin server does not correctly implement content negotiation, a cache client might receive the wrong response. For example, if an HMTL (spelling) HMTL - Do you mean HTML? page has something specific to Internet Explorer and gets cached, the cache might send it to a Netscape user. To prevent this from happening, the origin server is supposed to add a response header telling caches that the response depends on the User-agent value: Vary: User-agent If the cache ignores the Vary header, or if the origin server does not send it, cache users can get incorrect responses. Even when content negotiation is correctly implemented, it reduces the number of cache hits for the URL. If, response varies on the User-agent header, a cache must store a separate response for every User-agent it encounters.Note,this is more than just Netscape or MSLE MSLE Multi-Shot Lightning Enhanced (Monsters in Diablo 2 computer game) .Rather, it is a string like Mozilla/4.05 [en] (x1 1.; I; FreeBSD 2 .2. 5-RELEASE i386; Nav). Thus, when a response varies on the User-agent header we can only get a cache hit for clients running the exact same version of the browser, on the same operating system. This ensures that your server sends accurate Last -modified and Expires time stamps in its responses. Even though newer versions of HTTP use techniques that are less susceptible to clock skew, many Web clients and servers still rely on the absolute time stamps. ntpd implements the Network Time Protocol (NTP (Network Time Protocol) A TCP/IP protocol used to synchronize the real time clock in computers, network devices and other electronic equipment that is time sensitive. It is also used to maintain the correct time in NTP-based wall and desk clocks. ) and is widely used to keep 7. Synchronize your system clocks with a reference clock. 8. Avoid using address-based authentication. Most caching proxies hide the addresses of clients. An origin server sees connections coming from the proxy' address, not the client's. Furthermore, there is no standard and safe way to convey the client's address in an HTTP request. Address-based authentication can also deny legitimate users access to protected information when they use a proxy cache. Many organizations use a DMZ (DeMilitarized Zone) A middle ground between an organization's trusted internal network and an untrusted, external network such as the Internet. Also called a "perimeter network," the DMZ is a subnetwork (subnet) that may sit between firewalls or off one leg of a network for the 9. Think Different. Sometimes, those of us in the United States forget about Internet users in other parts of the world. In some countries, Internet bandwidth is so constrained that we would find it appalling. What takes seconds or minute to load in the US. may take hours or even days in some locations. I strongly encourage you to remember bandwidth-starved users when designing your Web sites, and remember that improved cachability speeds up your Web site for such users. 10. Even if you think shared proxy caches are evil, consider allowing single-user-browser caches to store your pages. There is a simple way to accomplish this with HTTP 1.1. Just add the following header to your server's replies Cache-control: Private This header allows only browser caches to store responses. The browser may then perform a validation reque, on the cached object as necessary. Duane Wessels is the author of the book 'Web Caching' available from o'reilly www.o'reilly. com |
|
||||||||||||||

`rī)
Printer friendly
Cite/link
Email
Feedback
Reader Opinion