1987WEB视界-分享互联网热点话题和事件

您现在的位置是:首页 > WEB开发 > 正文

WEB开发

6.824 Lab 1: A simple web proxy

1987web2024-03-25WEB开发31
6.824Lab1:Asimplewebproxy6.824-Spring20046.824Lab1:AsimplewebproxyDue:

6.824 Lab 1: A simple web proxy6.824 - Spring 20046.824 Lab 1: A simple web proxyDue: Tuesday, Febru

6.824 - Spring 2004

6.824 Lab 1: A simple web proxy

Due: Tuesday, February 10th, 1:00pm.

Introduction

Please read Getting started with6.824 labs before starting this assignment. You will also need Using TCPthrough sockets at a later stage.

If you have questions, please first read Office hours andasking questions. After you have done that, you can send e-mail to 6.824-staff@pdos.lcs.mit.edu.

In this lab assignment you will write a simple web proxy. A webproxy is a program that reads a request from a browser, forwards thatrequest to a web server, reads the reply from the web server, andforwards the reply back to the browser. People typically use webproxies to cache pages for better performance, to modify web pages intransit (e.g. to remove annoying advertisements), or for weakanonymity.

Youll be writing a web proxy to learn about how to structureservers. For this assignment youll start simple; in particular yourproxy need only handle a single connection at a time. It should accepta new connection from a browser, completely handle the request andresponse for that browser, and then start work on the next connection.(A real web proxy would be able to handle many connectionsconcurrently.)

In this handout, we useclientto mean an applicationprogram that establishes connections for the purpose of sendingrequests[3], typically a web browser (e.g.,lynx or Netscape). We useserverto mean an applicationprogram that accepts connections in order to service requests bysending back responses (e.g., the Apache web server)[1]. Note that a proxy acts as both a client and server.Moreover, a proxy could communicate with other proxies (e.g., a cachehierarchy).

Design Requirements

Your proxy will speak a subset of the HTTP/1.0 protocol, which isdefined in RFC 1945.Youre only responsible for a small subset of HTTP/1.0, so you canignore most of the spec. You should make sure your proxy satisfiesthese requirements:

GET requests work.Images/Binary files are transferred correctly.Your webproxy should properly handle Full-Requests (RFC1945, Section 4.1) up to, and including, 65535 bytes. You shouldclose the connection if a Full-Request is larger than that.You must support URLs with a numerical IP address instead of theserver name (e.g. http://18.181.0.31/).You are not allowed to usefork().You may not allocate more than 100MB of memory.You can not have more than 32 open file descriptors.Your proxy should correctly service each request if possible. Ifan error occurs, and it is possible for the proxy to continue withsubsequent requests, it should close the connection and then proceedto the next request. If an error occurs from which the proxy cannotreasonably recover, the proxy should print an error message on thestandard error and callexit(1). There are not manynon-recoverable errors; perhaps the only ones are failure of theinitialsocket(),bind(),listen()calls, or a call toaccept(). Theproxy should never dump core except in situations beyond your control(e.g. a hardware or operating system failure).

You donothave to worry about correct implementation of any ofthe following features; just ignore them as best you can:

POST or HEAD requests.URLs of any type other than http.HTTP-headers (RFC1945, Section 4.2).

If your browser can fetch pages and images through your proxy, andyour proxy passes our tester (see below), youre done.

HTTP example without a web proxy

HTTP is a request/response protocol that runs over TCP. A clientopens a connection to a web server and sends a request for a file; theserver responds with some status information and the file contents,and then closes the connection.

You can try out HTTP yourself:

%telnet web.mit.edu80

This connects toweb.mit.eduon port 80, the default portfor HTTP (web) servers.

Then type

GET/HTTP/1.0

followed by two carriage returns. This ends the header section ofthe request. The server locates the web page and sends it back. Youshould see it on your screen.

To form the path to the file to be retrieved on a server, theclient takes everything after the machine name. For example,http://web.mit.edu/resources.htmlmeans we should ask for thefile/resources.html. If you see a URL with nothing after themachine name and port, then/is assumed---the server figuresout what page to return when just given/. Typically thisdefault page isindex.htmlorhome.html.

On most servers, the HTTP server lives on port 80. However, onecan specify a different port number in the URL. For example, typinghttp://web.mit.edu:2206in your browser will tell it to finda web server on port 2206 on web.mit.edu. (No, this doesnt work forthis address.)

HTTP (request) example with a web proxy

Before you can do this example, you need to tell your web browserto use a web proxy. This explanation assumes you are running Mozilla,but things should be remarkably similar for Netscape. Choose ``Edit---> ``Preferences. Then choose ``Advanced --->``Proxies. Click on ``Manual proxy configuration. Now set the``HTTP proxy tospeakeasy-mit-ron.lcs.mit.eduand port 3128.Mozilla will now send all HTTP request to this web proxyrather than directly to web servers.

Lynx---a poor mans browser---can be told to use this web proxy bysetting the environment variablehttp_proxytospeakeasy-mit-ron.lcs.mit.edu:3128.

Now to the real stuff.

You can usencto peek at HTTP requests that a browsersends to a web proxy.nclets you read and write data acrossnetwork connections using UDP or TCP[10]. The classmachines havencinstalled.

First well examine the requests that a browser sends to the proxy.Well usencto listen on a port and direct our web browser(Lynx) to use that host and port as a proxy. Were going to letnclisten on port 8888 and tell Lynx to use a web proxyon port 8888.

%nc-lp8888

This tellsncto listen on port 8888. Chances are that youwill have to choose a different port number than 8888 because someone else maybe using that port. Choose a number greater than 1024, less than 65536. Nowtry, on the same machine, to retrieve a web page port 8888 as a proxy:

%env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.com

This tells Lynx to fetchhttp://www.yahoo.comusing aweb proxy on port 8888, which happens to be our spy friendnc.

Netcat neatly prints out the request headers that Lynx sent:

%nc-lp8888GET http://www.yahoo.com/ HTTP/1.0Host:www.yahoo.comAccept:text/html,text/plain,application/vnd.rn-rn_music_package,application/x-freeamp-theme,audio/mp3,audio/mpeg,audio/mpegurl,audio/scpls,audio/x-mp3,audio/x-mpeg,audio/x-mpegurl,audio/x-scpls,audio/mod,image/*, video/mpeg, video/*Accept: application/pgp, application/pdf, application/postscript, message/partial, message/external-body, x-be2, application/andrew-inset, text/richtext, text/enriched, x-sun-attachment, audio-file, postscript-file, default, mail-fileAccept: sun-deskset-message, application/x-metamail-patch, application/msword, text/sgml, */*;q=0.01Accept-Encoding:gzip,compressAccept-Language:enUser-Agent:Lynx/2.8.4rel.1libwww-FM/2.14SSL-MM/1.4.1OpenSSL/0.9.6b

The GET request on the first tells the proxy to get filehttp://www.yahoo.comusing HTTP version 1.0. Notice how thisrequest is quite different from the example without a web proxy! Theprotocol and machine name (http://www.yahoo.com) are nowpart of the request. In the previous example this part was omitted.Look in RFC 1945 for details on the remaining lines. (Its effectivereading material if you really cant sleep and Dostoevsky didnt dothe trick.)

HTTP (reply) example with a web proxy

The previous example shows the HTTP request. Now well try to see what areal web proxy (speakeasy-mit-ron.lcs.mit.edu port 3128) sends to a web server. To achievethis we usencto be a fake web server. Start the ``fake server onanguish.lcs.mit.eduwith the following command:

%nc-lp8888

Again, you may have to choose a different number if 8888 turns out to betaken by someone else.

%env http_proxy=http://speakeasy-mit-ron.lcs.mit.edu:3128/ lynx -source http://anguish.lcs.mit.edu:8888

Needless to say, you should replace 8888 by whatever port you chose to runncon.ncwill show the following request:

%nc-lp8888GET/HTTP/1.0Accept:text/html,text/plain,audio/x-pn-realaudio,audio/vnd.rn-realaudio,application/smil,text/vnd.rn-realtext,video/vnd.rn-realvideo,image/vnd.rn-realflash,application/x-shockwave-flash2-preview,application/sdp,application/x-sdpAccept:application/vnd.rn-realmedia,image/vnd.rn-realpix,audio/wav,audio/x-wav,audio/x-pn-wav,audio/x-pn-windows-acm,audio/basic,audio/x-pn-au,audio/aiff,audio/x-aiff,audio/x-pn-aiff,text/sgml,video/mpeg,image/jpeg,image/tiffAccept:image/x-rgb,image/png,image/x-xbitmap,image/x-xbm,image/gif,application/postscript,*/*;q=0.01Accept-Encoding: gzip, compressAccept-Language: enUser-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14Host: anguish.lcs.mit.edu:8888X-RAN-Loopstop: trueX-RAN-Loopstop: trueVia: 1.0 speakeasy.ron.lcs.mit.edu:3128 (squid/2.5.STABLE2), 1.0 speakeasy.ron.lcs.mit.edu:3148 (squid/2.5.STABLE4), 1.0 nyu.ron.lcs.mit.edu:3128 (squid/2.5.STABLE4)X-Forwarded-For: 18.26.4.9, 127.0.0.1, unknownCache-Control: max-age=259200Connection: keep-alive

Notice how the web proxy stripped away thehttp://anguish.lcs.mit.edu:8888part from the request!

Your web proxy

Your web proxy will have to translate between requests that theclient makes (the one that starts with ``GET http://machinename)into requests that the server understands. So far for the bad news.The good news is that we provide you with some helpful code that willmake this very easy to do.

Your web proxy will listen on a port other than port 80, so as toavoid conflicts with regular web servers.

Once the request line has been received, the web proxy shouldcontinue reading the input from the client until it encounters a blankline. The proxy should then fetch the URL from the appropriate server,forward the response back to the client, and close the connection.The proxy should forward response data as it arrives, rather thanbuffering the entire response; this allows the proxy to handle hugeresponses without running out of memory.

Your web proxy has to support theGETmethod only [3]. A GET method takes two arguments: the file tobe retrieved and the HTTP version. Additional headers may follow therequest.

Getting Started

We have provided a skeleton webproxy directory. It is available athttp://pdos.lcs.mit.edu/6.824/labs/webproxy1.tar.gz. The following sequence ofcommands should yield a compiled version of the server you shouldextend to pass the tests.

%wget http://pdos.lcs.mit.edu/6.824/labs/webproxy1.tar.gz%tar xzvf webproxy1.tar.gz%cd webproxy1%gmake

The tarball containshttp.C,http.h,Makefile,webproxy1.Candwebproxy1-test.C. The first two files willhelp you parse HTTP requests. TheMakefileis, as itsmeaningful name implies, a Makefile.Webproxy1.Cis a prettyuseless web server that, nonetheless, should help you on your way.webproxy1-test.Cis our testing program which checks your program for correctness.

http.C and http.h : a HTTP parser

We have provided a parser for proxy-style HTTP requests. It isimplemented in the fileshttp.Candhttp.hthat areincluded in the tarball.

http.h defines the classhttpreqthat inherits from the classhttpparse(if you are unfamiliar with C++ inheritance,consult the Stroustrup C++ language guide referenced in the course information page. Dont dropthis book on someones face. Its a pretty hefty book.)

To parse a request, first create ahttpreqobject.Then, parse the (potentially incomplete) HTTP request by feeding it toint parse (char *buf, ssize_t len)until it returns 1,indicating that the headers are complete.bufshould bethe buffer that contains the (potentially incomplete) HTTP request.lenis the length of the HTTP request fragment inbuf. Notice thatparseneeds to see thewholerequest you have read so far.

parsereturns 1 if the HTTP request is complete, 0 ifit needs more data to complete, or -1 on a parse error.parsedoes not modify the contents ofbuf.Onceparsereturns 1, you can call---amongst others---thefollowing methods on the callinghttpreq.

char* method()The type of request (POST, GET, HEAD)char* host()The destination hostshort port()The destination portchar* path()The filename part of the requested URLchar* url()The requested URL

Heres a simple program that illustrates the use ofhttpreq.

#include#include"http.h"intmain(){httpreq*r=newhttpreq();charbuf[512];intret;// incomplete headerstrcpy(buf,"GET http://web.mit.edu/index.html");ret=r->parse(buf,strlen(buf));printf("ret %d file %s\n",ret,ret==1?r->path():"(none)");// complete headerstrcat(buf," HTTP/1.0\r\n\r\n");ret=r->parse(buf,strlen(buf));printf("ret %d file %s\n",ret,ret==1?r->path():"(none)");deleter;exit(0);}

Documentation

You may want to read Using TCP throughsockets to learn about socket programming in C/C++. Also, take alook at the references at the bottom of this page.

Running and testing the proxy

Your proxy program should take exactly one argument, a port number onwhich to listen. For example, to run the proxy on port 2000:

%./webproxy12000

As a first test of the proxy you should attempt to use it to browse theweb. Set up your web browswer to use one of the class machines running yourproxy as a proxy and experiment with a variety of different pages.

When you think your proxy is ready, you can run it against the test programwebproxy1-test, our tester. Run the tester with your proxyas an argument:

%./webproxy1-test./webproxy1

Note that this may take several minutes to complete. The test program runsthe following tests:

Ordinary fetch

This test is the "normal case". We send a normal HTTP 1.0 GETrequest and expect the correct web page.

Split request

This tests splits the HTTP request in two chunks. The first chunkcontains a partial HTTP request. The second chunk completes thefirst after which the tester expects the correct web page contentsto come back.

Large request

The tester does a request of exactly 65535 bytes.

Large response

The tester fetches a web page larger than the maximum amount ofmemory available to your web proxy.

Zero-size response

The tester fetches a web page without a body.

Recover after bad connect

The tester sends a request with a URL that specifies a false port.Your proxy will attempt to make a connection to a bogus port. Soonthereafter, the tester tries to fetch a valid page to see if your proxy isstill doing ok.

Malformed request

The tester sends an HTTP request that is not syntactically correct.After that, it tries to fetch a valid page to see if it your proxy is stilldoing ok.

Premature client close()

The tester sends a partial HTTP request and then closes the connection.After that, it tries to fetch a valid page to see if it your proxy is stilldoing ok.

Infinitely long request

The tester swamps your proxy with a request larger than 65535 bytes. Thetester expects your proxy to close the connection. After that, it tries tofetch a valid page to see if it your proxy is still doing ok.

Stress test

The tester stress tests your web proxy with a ruthless combination ofordinary fetches, split requests, malformed requests, and large responses.This may expose memory leaks, unclosed connections, and random other bugs.

Collaboration policy

You must write all the code you hand in for the programming assignments,except for code that we give you as part of the assigment. You arenot allowed to look at anyone elses solution (and youre not allowedto look at solutions from previous years). You may discuss theassignments with other students, but you may not look at or copyeach others code.

Handin procedure

You should hand in a gzipped tarballwebproxy1-handin.tgzproducedbygmake dist. Copy this file to~/handin/webproxy1-handin.tgz. Donotmake this fileworld readable! We will use thefirstcopy of the file that we canfind after the deadline---we try every few minutes. Dont bother to copy anew version over the old one hoping that we will use it instead. Wewont.

References

1Apache Web Proxy,http://www.apache.org/docs/mod/mod_proxy.html.2T. Berners-Lee, et al.RFC1945: Hypertext Transfer Protocol - HTTP/1.0, May1996.3CERN Web Proxy,http://www.w3.org/Daemon/User/Proxies/Proxies.html.4Netcat.http://www.atstake.com/research/tools/nc110.txt.