Web Services using a TCP proxy server (Part 2 of 3)
Last edited 2003-04-13 by Jay Nelson

This tutorial demonstrates a simple, but non-trivial, example of a TCP proxy server using the OTP construct gen_server. The goal is to implement a process that can be used to deliver web services to HTML browsers. The source code is here.

This tutorial contains three parts:

  1. A generic TCP proxy server based on gen_server
  2. How to implement the ProxyModule interface
  3. A Web Service using the TCP proxy server

Please, report all errors, omissions or improvements to the author.

  1. Description of a proxy server
    1. The proxy server interface
    2. The gen_server behaviour
    3. Architecting a server design
    4. Implementing gen_server callbacks
    5. Implementing the Accept Connection process
  2. Implementing the ProxyModule interface
    1. ProxyModule explained
    2. Testing with an echo proxy server
    3. Echoing a browser request
    4. A proxy server to block websites
  3. Deploying Web Services
    1. Recognizing requests
    2. Extracting from web pages
      1. Synonym dictionary
      2. Pattern matching dictionary
      3. Zoning a page
      4. Pulling content from a zone
    3. Reconstructing web pages

2.1 ProxyModule explained

In the first part of this tutorial we defined a tcp_proxy server which made calls at strategic points to a ProxyModule. The ProxyModule was not defined because it is the task of the application writer to provide an appropriate ProxyModule that fits within the protocol defined by the tcp_proxy server.

The ProxyModule is expected to implement the following functions:

init() -> {ok, State} | {error, Reason}
The purpose of this function is to perform any initial setup that is required to support the other functions in the interface. This function might create an ets table, allocate a memory structure or read from a database or disk file. The function is called by tcp_proxy:init when the server is started.

terminate(State) -> Result ignored
This function cleans up whatever data structures, processes or other aspects of the ProxyModule that will not automatically be taken care of when the process ends.

server_busy(Socket) -> Result ignored
If the tcp_proxy server is currently handling too many clients, the accept connection will call this function. The ProxyModule is supposed to be 'nice' and issue a message to the client quickly and then return. The purpose of this function is to allow, for example, a webserver proxy module to return a server busy page to the client so that it isn't confused with a network problem.

react_to(Server, Socket, BinData) -> ok
The client request is read from the Socket by the tcp_proxy server in the tcp_proxy:relay/2 function. This allows the contents of the request to be used in determining the appropriate ProxyModule to handle the request. The contents of the initial request and the still open Socket are passed to this function of the ProxyModule for a response. The ProxyModule will be running in a new process that is separate from the tcp_proxy server and can act on the Socket any way it wishes, knowing that it is the sole owner of the Socket.

ProxyModule:Request(From, State, Data) -> {reply, ok, NewState} | {error, Reason, NewState}
This is a catch all that lets the ProxyModule do anything it wants. It is invoked by calling tcp_proxy:handle_request(Server, Request, Data). The final argument can be any parameter or data structure, while Request is the name of the function to be called. This function may be called from the initiating application which called tcp_proxy:start_link or it can be called from the ProxyModule:react_to/3 function. The key features of using this function are: 1) it is a synchronous call to the tcp_proxy gen_server process and 2) it is the only way to gain access and/or to modify the State that was created by ProxyModule:init/0.

2.2 Testing with an echo proxy server

To test the tcp_proxy server we start with a simple echo server shown in Listing 1 which is accessed via telnet. The echo server just repeats back what it receives. Not very useful but a simple method for testing whether processes are spawned properly.

When the echo server starts, nothing needs to be done. When it receives a message to react to, it just writes "You said: " followed by the original message back to the Socket that it received the message from. Then it closes the Socket and ends the process. Since initialization did nothing, terminate need do nothing. We will see a case later where the initialization data is important to subsequent message handling.

Listing 1. echo module implementation.
-module(echo).

-export([init/0, terminate/1, server_busy/1, react_to/2]).

init() ->
    {ok, []}.

terminate(State) ->
    ok.

server_busy(Socket) ->
    gen_tcp:send(Socket, <<"The Server is too busy right now.\n">>),
    ok.

react_to(Server, Socket, BinData) ->
    gen_tcp:send(Socket, [<<"You said: ">>, BinData]),
    gen_tcp:close(Socket),
    ok.

To test out the server, compile the tcp_proxy module in an erlang shell and then compile the echo module. Start the toolbar application as shown in Listing 2. When the toolbar window appears, click on the process monitor icon to show a listing of all the current processes and click on the checkbox that hides the system processes.

Start the tcp_proxy server running with the ProxyModule of echo as shown at prompt 4. echo will appear as a gen_server in the process monitor and the process id will match what was returned on the shell command line when the server was started. The spawned accept listener will be listed as well.

Next open four shell windows. Type telnet localhost 8000 in each one to make a connection to the tcp_proxy server. Each time a connection is made, a new accept process should appear in the process monitor, and the old one should change to be a recv process. Since tcp_proxy is configured for only three simultaneous connections, the fourth attempt will report that the server is too busy and will refuse the connection. Under normal circumstances the limit can be set much higher, but for now it is useful for testing purposes to use a low number.

After the four telnet sessions have been attempted, request a list of the currently attached clients. As shown below, the counts for total_requests, server_busy and active_clients will reflect the activity that has taken place so far. The clients property contains a list of the Monitor reference and Process id for each of the three telnet sessions that succeeded. Next type 'hello' into each of the three remaining telnet sessions. The attached process will echo back what you typed and then will terminate. Requesting a report of the clients shows none attached.

To exit the server, request it to stop as shown in the final entry of Listing 2. The reports that appear are info messages that reflect the termination of the tcp_proxy server and the failure of the accept process that was waiting for a connection.

Listing 2. Creating and interacting with the echo server process.
Erlang (BEAM) emulator version 5.2 [source] [hipe]

Eshell V5.2  (abort with ^G)
1> c(tcp_proxy).
{ok,tcp_proxy}

2> c(echo). 
{ok,echo}

3> toolbar:start().
<0.40.0>

4> {ok, P1} = tcp_proxy:start_link(echo).
{ok,<0.53.0>}

5> tcp_proxy:report_clients(P1).
{ok,[{proxy_module,echo},
     {active_clients,3},
     {clients,[{#Ref<0.0.0.486>,<0.56.0>},
               {#Ref<0.0.0.488>,<0.57.0>},
               {#Ref<0.0.0.484>,<0.54.0>}]},
     {max_active_clients,3},
     {total_requests,4},
     {server_busy,1},
     {accept_failures,0}]}

6> tcp_proxy:report_clients(P1).
{ok,[{proxy_module,echo},
     {active_clients,0},
     {clients,[]},
     {max_active_clients,3},
     {total_requests,4},
     {server_busy,1},
     {accept_failures,0}]}

7> tcp_proxy:stop(P1).
** exited: requested **

=ERROR REPORT==== 22-Mar-2003::22:47:21 ===
** Generic server echo terminating 
** Last message in was stop
** When Server state == {tp_state,[binary,
                                   {packet,0},
                                   {active,false},
                                   {reuseaddr,true}],
                                  8000,
                                  #Port<0.101>,
                                  <0.59.0>,
                                  0,
                                  echo_tcp_proxy_clients,
                                  0,
                                  3,
                                  4,
                                  1,
                                  echo,
                                  []}
** Reason for termination == 
** requested
8>
=INFO REPORT==== 22-Mar-2003::22:47:21 ===
    "gen_tcp:accept": {error,closed}

2.2 Echoing a browser request

Ultimately, we want clients to connect using a normal browser. Listing 3 shows the echoweb module. This module is a little more useful because it shows the information your browser sends to a website when it requests an HTML page. The code that is listed depends on a module of binary utilities and a module of web utilities, as well as a header file for web page output.

Since user requests are read as binaries, and because there are few examples of binary manipulation utilities, this example parses the HTTP request as a binary rather than converting it to a list. The main advantage is that the results are going to be fed back to the open socket and sending a binary is a little more efficient than sending a list. Searching and parsing on binaries can also be more efficient, depending on how they are manipulated. Note that using a binary does not automatically guarantee better performances either from a speed standpoint or from a memory footprint. Always benchmark if you are tuning for performance, however, never tune for performance until the code is correct.

Listing 3. echoweb displays browser requests.
-module(echoweb).

-export([init/0, terminate/1, server_busy/1, react_to/3]).
-include("web_utils.hrl").

init() ->
    {ok, []}.

terminate(State) ->
    ok.

% When the server is busy, send a real HTML page.
server_busy(Socket) ->
    gen_tcp:send(Socket, ?SERVER_BUSY),
    ok.

% GET request
react_to(Server, Socket, Data = <<"GET ", Rest/binary>>) ->

    % Parse the data in the header
    {URL, Hdrs} = web_utils:parse_get_request(Rest),
    Host = web_utils:get_matching_vals(>>"Host">>, Hdrs),

    % Echo it back to the browser
     gen_tcp:send(Socket,
		  concat_binary([<<"HTTP/1.0 200 Ok\r\n">>,
				 <<"Connection: close\r\n">>,
				 <<"Content-Type: text/plain\r\n\r\n">>,
				 Data,
				<<"\r\n\r\n">>,
				<<"-------- Parsed Results -------\r\n">>,
				<<"URL: ">>, URL, <<"\r\n">>,
				Host])),
    gen_tcp:close(Socket),
    ok;

% POST request
react_to(Server, Socket, Data = <<"POST ", Rest/binary>>) ->

    % Parse the data in the header
    {URL, Hdrs, FormAttrs} = web_utils:parse_post_request(Rest),
    Accept = web_utils:get_matching_attrs(<<"Accept">>, Hdrs),
    Attrs = [concat_binary([Param, <<" -> ">>, Value, <<"\r\n">> ])
	     || [Param,Value] <- FormAttrs],

    % Echo it back to the browser
    gen_tcp:send(Socket,
		 concat_binary([<<"HTTP/1.0 200 Ok\r\n">>,
				<<"Connection: close\r\n">>,
				<<"Content-Type: text/plain\r\n\r\n">>,
				Data,
				<<"\r\n\r\n">>,
				<<"-------- Parsed Results -------\r\n">>,
				<<"URL: ">>, URL, <<"\r\n">>,
				Accept, <<"\r\n">>,
				Attrs])),
    gen_tcp:close(Socket),
    ok;

% All others are considered Bad Requests (although some should be valid).
react_to(Server, Socket, Other) ->
    gen_tcp:send(Socket, ?BAD_REQUEST),
    gen_tcp:close(Socket),
    {error, bad_request, Other}.

The code in this module is very similar to the echo module except that additional header information needs to precede the return so that the browser will display the results as text. The header data starts with "HTTP" and ends with the "Content-Type" clause followed by two carriage return / newline pairs. To try out the code, start the tcp_proxy server using the command tcp_proxy:start_link(echoweb) after compiling the new modules in the erlang shell. Then go into your browser configuration and change it from using a modem or direct connected LAN to using a proxy server. Specify the HTTP proxy server as "localhost" port "8000" (in Mozilla this is found under Edit -> Preferences... Advanced, Proxies and in a similar location on the preferences or configuration menu for other browsers). Now type www.yahoo.com into the address box of the browser and the tcp_proxy server will respond with the information that your browser sent in its request.

To test out a POST request, turn off the proxy option in the browser and visit www.google.com, click on "Advertise With Us" and then click on "contact us today". Now turn the proxy preference back on, fill out the form and hit the submit button. You will receive the full request followed by the parsed parameters and attributes that were submitted in the body of the form. As you can see this utility is a handy tool for debugging browser requests.

2.3 A proxy server to block websites

The last simple example is an application called WebCensor which blocks naughty websites from being displayed in the user's browser. The server loads an external file containing website URLs that are to be blocked. The URLs are stored in an ordered ets table for easy access when verifying the user's browser requests. In addition, another program may call the server and add new URLs to the list to be blocked by using tcp_proxy:handle_request(Server, add_url, URL). When the server shuts down, it writes the new set of URLs to the external disk file so that when the server is restarted it will contain the current full set of URLs. Since the ets table is ordered, the URLs will be in alphabetical order in the external file.

Listing 4 contains the code for initialization and termination of the WebCensor ProxyModule. This time we have to use the init/0 function to create the ets table and load the external file of URLs. The function add_url/3 is used not only in initialization but also as an external server function which allows other software to add URLs to the ets table. When the server terminates, it runs over the ets table and writes all the URLs to disk as one big binary block. Note the combination of list comprehension and binaries provides a succinct way of handling large blocks of I/O data (although appending to a binary will result in a copy being created).

Listing 4. Initialization and termination of webcensor.
-module(webcensor).

-export([init/0, add_url/3, terminate/1, server_busy/1, react_to/3]).
-include("web_utils.hrl").
-define(BLOCK_LIST_FILE, "webcensor.block").
-define(BLOCK_LIST_TABLE, 'WebCensor').

init() ->
    Table = ets:new(?BLOCK_LIST_TABLE, [ordered_set, named_table]),
    URLs = file:read_file(?BLOCK_LIST_FILE),
    case URLs of
	{ok, Binary} ->
	    Sites = bin_utils:split_lines(Binary),
	    io:format("~w blocked websites loaded~n", [length(Sites)]),
	    load(Table, Sites);
	Other ->
	    ok
    end,
    {ok, Table}.

% Load the table of URLs to block
load(Table, [URL | More]) ->
    add_url(init, Table, URL),
    load(Table, More);
load(Table, []) ->
    ok.

% Used by load and as an external server call
add_url(From, URL) ->
    add_url(From, ?BLOCK_LIST_TABLE, URL).

add_url(From, Table, URL) when list(URL) ->
    add_url(From, Table, list_to_binary(URL));
add_url(_From, Table, URL) when binary(URL) ->
    case URL of
	<<"http://", _Rest/binary>> ->
	    ets:insert(Table, {URL});
	_Other ->
	    ets:insert(Table, {<<"http://", URL/binary>>})
    end,
    {ok, Table}.

% Write the table to disk in case new URLs were added    
terminate(Table) ->
    AllURLs = [<<URL/binary, "\n">> || {URL} <- ets:tab2list(Table)],
    file:write_file(?BLOCK_LIST_FILE, concat_binary(AllURLs)),
    ok.

The react_to/3 function shown in Listing 5 parses the request, checks whether the URL should be blocked and then either delivers a blocked message page or retrieves the webpage for the URL. Note that because the browser request is parsed, it is a simple matter of modifying the request by trimming out the cookie or spoofing a different browser type to create other useful tcp_proxy tools.

Listing 5. Responding to client requests in webcensor.
% Retrieve a GET page if it is not blocked
react_to(Server, Client, Data = <<"GET ", Rest/binary>>) ->
    {URL, Hdrs} = web_utils:parse_get_request(Rest),

    %% Uncomment this line to see what pages are requested
    %% io:format("~s~n", [binary_to_list(URL)]),

    case ets:member(?BLOCK_LIST_TABLE, URL) of
	true ->
	    gen_tcp:send(Client, ?ACCESS_BLOCKED);
	false ->
	    [Host] = web_utils:get_matching_vals(<<"Host">>, Hdrs),
	    get_html_page(Client, URL, Host, Hdrs, Data)
    end,
    gen_tcp:close(Client),
    ok;

% Retrieve a POST page if it is not blocked
react_to(Server, Client, Data = <<"POST ", Rest/binary>>) ->
    {URL, Hdrs, FormAttrs} = web_utils:parse_post_request(Rest),
    case ets:member(?BLOCK_LIST_TABLE, URL) of
	true ->
	    gen_tcp:send(Client, ?ACCESS_BLOCKED);
	false ->
	    [Host] = web_utils:get_matching_vals(<<"Host">>, Hdrs),
	    post_html_page(Client, URL, Host, Hdrs, FormAttrs, Data)
    end,
    gen_tcp:close(Client),
    ok;
   
% All others are considered Bad Requests (although some should be valid).
react_to(Server, Client, Other) ->
    gen_tcp:send(Client, ?BAD_REQUEST),
    gen_tcp:close(Client),
    {error, {bad_request, Other}}.

% When the server is busy, send a real HTML page.
server_busy(Client) ->
    gen_tcp:send(Client, ?SERVER_BUSY),
    ok.

Study the support functions in the webcensor.erl to understand how a connection to a webserver is made, read and parsed. It is left as an exercise to the reader to make the implementation less brittle. As it stands a complete URL must be specified in the block list, however, there are several special cases such as when the URL has no trailing "/" or when the page defaults to index.html. Replace the ets implementation with a balanced tree that contains a node for each hostname and a balanced tree of blocked pages on that server. You will need to update init/0, add_url/3 and terminate/1 to maintain the database properly, and then update the react_to/3 code to search the database. As an alternative, you may find it easier to use mnesia to store the database of blocked URLs.

Continued in Part 3.