Saturday, March 20, 2010

HTTP Debug Proxy with Twisted

Motivation

Recently, I've been building a few distributed components for an upcoming project. The components talk to each other over XMLRPC. So far, the connectivity is PHP-Java, Java-PHP and Java-Java. On the Java side, I use Apache XMLRPC library to build the clients and servers. The PHP side is basically Drupal's XMLRPC service.

Apache XMLRPC provides Java abstractions at both the client and the server ends, so a programmer only needs to work with Java objects. The library takes care of the generation and parsing of the XML request and response - while this is mostly very convenient, sometimes it is helpful for debugging to see the actual XML request and response. This is what initially prompted me to look for this kind of script, and ultimately build one.

Background

I initially tried netcat with tee, but couldn't make it work the way I wanted it to across both my CentOS and Mac OSX machines. To be honest, I didn't try too hard, because the nc/tee combination outputs to two separate files, and I wanted it in one single output.

There are actually two Python scripts which do about the same thing as the one I built. The HTTP/XMLRPC Debug proxy from myelin came closest to what I wanted, but I would have to hack it a bit to accomodate arbitary source ports. Another proxy was Xavier Defrang's HTTP Debugging Proxy which looked promising, but its HTTP only, and I wanted to use it (in the future) for protocols other than HTTP.

One nice (but non-free) tool in this space is Charles. This would be a good model for someone looking to build an Eclipse plugin :-).

I started out building something with Python sockets based on Gordon McMillans's Python Socket Programming HOWTO, but gave up when I started having problems with blocking in send() and recv() - my knowledge of socket programming wasn't enough to follow him down the rabbit hole of select() calls.

I ultimately settled on using Twisted, based on a rather lively discussion which pointed me to Twisted in the first place. What I liked about Twisted is that its very object oriented and feels almost like Java. A Twisted network component (client or server) is built using a protocol and a factory class, plus an optional "business logic" class. The components are not started directly, they are injected into the Twisted reactor object (similar to a IoC container) and the reactor started.

Twisted does have a steep learning curve, but Twisted Matrix Labs provides excellent documentation. There is also the O'Reilly book by Abe Fettig, which I tried to get but couldn't find at my local Borders bookstore. But the online documentation and tutorial is quite good, you can actually figure Twisted out from there.

Architecture

What I wanted was something that will hook in between the client and the server, and print out the request and response on the console, as shown in the figure below. The dotted blue and red lines represent the normal request and response flows respectively. The idea is to repoint the client at the HTTP proxy, and have the proxy forward the request over to the server application.

As you can see, the HTTP proxy is actually a pipeline of a server component and a client component. The server just listens on the port, spawning off a new client-server pair per each incoming connection. Once the server gets the data, it starts a client that sends the data over to the target (server application), gets back the response, and hands it back to the server, which sends it back to the source (client application) and terminates the connection.

Here's the script to do this all.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
#!/usr/local/bin/python
# $Id$
# $Source$
import getopt
import string
import sys

from twisted.internet import protocol
from twisted.internet import reactor

class ConsoleWriter():
  """ Write request (on source port) and response (from target host:port) """
  """ to console. Also holds on to the "latest" data received for output  """
  """ into the proxy server's response stream.                            """

  def write(self, data, type):
    if (data):
      lines = data.split("\n")
      prefix = "<" if type == "request" else ">"
      for line in lines:
        sys.stdout.write("%s %s\n" % (prefix, line))
    else:
      sys.stdout.write("No response from server\n")


class DebugHttpClientProtocol(protocol.Protocol):
  """ Client protocol. Writes out the request to the target HTTP server."""
  """ Response is written to stdout on receipt, and back to the server's"""
  """ transport when the client connection is lost.                     """

  def __init__(self, serverTransport):
    self.serverTransport = serverTransport

  def sendMessage(self, data):
    self.transport.write(data)
  
  def dataReceived(self, data):
    self.data = data
    ConsoleWriter().write(data, "response")
    self.transport.loseConnection()

  def connectionLost(self, reason):
    self.serverTransport.write(self.data)
    self.serverTransport.loseConnection()


class DebugHttpServerProtocol(protocol.Protocol):
  """ Server Protocol. Handles data received from client application.   """
  """ Writes the data to console, then creates a proxy client component """
  """ and sends the data through, then terminates the client and server """
  """ connections.                                                      """

  def dataReceived(self, data):
    self.data = data
    ConsoleWriter().write(self.data, "request")
    client = protocol.ClientCreator(reactor, DebugHttpClientProtocol, self.transport)
    d = client.connectTCP(self.factory.targetHost, self.factory.targetPort)
    d.addCallback(self.forwardToClient, client)

  def forwardToClient(self, client, data):
    client.sendMessage(self.data)


class DebugHttpServerFactory(protocol.ServerFactory):
  """ Server Factory. A holder for the protocol and for user-supplied args """

  protocol = DebugHttpServerProtocol

  def __init__(self, targetHost, targetPort):
    self.targetHost = targetHost
    self.targetPort = targetPort


def usage():
  sys.stdout.write("Usage: %s --help|--source port --target host:port\n"
    % (sys.argv[0]))
  sys.stdout.write("-h|--help: Show this message\n")
  sys.stdout.write("-s|--source: The port on the local host on which this \n")
  sys.stdout.write("             proxy listens\n")
  sys.stdout.write("-t|--target: The host:port which this proxy talks to\n")
  sys.stdout.write("Both -s and -t must be specified. There are no defaults.\n")
  sys.stdout.write("To use this proxy between client app A and server app B,\n")
  sys.stdout.write("point A at this proxy's source port, and point this\n")
  sys.stdout.write("proxy's target host:port at B. The request and response\n")
  sys.stdout.write("data flowing through A and B will be written to stdout for\n")
  sys.stdout.write("your visual pleasure.\n")
  sys.stdout.write("To stop the proxy, press CTRL+C\n")
  sys.exit(2)


def main():
  (opts, args) = getopt.getopt(sys.argv[1:], "s:t:h",
    ["source=", "target=", "help"])
  sourcePort, targetHost, targetPort = None, None, None
  for option, argval in opts:
    if (option in ("-h", "--help")):
      usage()
    if (option in ("-s", "--source")):
      sourcePort = int(argval)
    if (option in ("-t", "--target")):
      (targetHost, targetPort) = string.split(argval, ":")
  # remember no defaults?
  if (not(sourcePort and targetHost and targetPort)):
    usage()
  # start twisted reactor
  reactor.listenTCP(sourcePort,
    DebugHttpServerFactory(targetHost, int(targetPort)))
  reactor.run()


if __name__ == "__main__":
  main()

The server is defined using the DebugHttpServerProtocol and DebugHttpServerFactory, and the client is defined using the DebugHttpClientProtocol. The ConsoleWriter just writes a formatted request and response data to the console.

When a request comes in from the client application, it is sent to DebugHttpServerProtocol.dataReceived, where the data is first written out to the console. A client object is then created using ClientCreator, which takes the DebugHttpClientProtocol and a reference to the server's transport object. The client then connects to the target host and port, and a callback added for the client to relay the request over to the target server once the client connects.

Once the client connects, the callback is triggered, which relays the request across to the target host. The response from the target host is captured by DebugHttpClientProtocol.dataReceived(), which writes the data to the console, then loses the connection. The connection lost event is captured by the connectionLost() method, which writes the response back to the caller and closes the connection.

Testing/Usage

To test the proxy, I started up the proxy to listen to port 1234 and forward to my test Drupal instance running on port 80. I then repointed the service URL in my JUnit test from http://localhost/services/xmlrpc to http://localhost:1234/services/xmlrpc. The JUnit test sends a comment to Drupal's XMLRPC comment.save service.

1
sujit@cyclone:network$ ./httpspy.py -s 1234 -t localhost:80

I then run my JUnit test from another console. On the console where I started httpspy.py, I see the following (the "< " and "> " signifies request and response).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
< POST /services/xmlrpc HTTP/1.1
< Content-Type: text/xml
< User-Agent: Apache XML RPC 3.0 (Jakarta Commons httpclient Transport)
< Host: localhost:1234
< Content-Length: 604
< 
< <?xml version="1.0" encoding="UTF-8"?><methodCall><methodName>comment.save
</methodName><params><param><value><struct><member><name>mail</name><value>f
oo@bar.com</value></member><member><name>subject</name><value>my stupid subj
ect (1269113982563)</value></member><member><name>nid</name><value><i4>8</i4
></value></member><member><name>name</name><value>godzilla</value></member><
member><name>comment</name><value>a test comment entry at 1269113982563 ms s
ince epoch</value></member><member><name>homepage</name><value>http://homesw
eethome.us</value></member></struct></value></param></params></methodCall>
<
<
> HTTP/1.1 200 OK
> Date: Sat, 20 Mar 2010 19:39:42 GMT
> Server: Apache/2.0.63 (Unix) PHP/5.2.11 DAV/2
> X-Powered-By: PHP/5.2.11
> Set-Cookie: SESS421aa90e079fa326b6494f812ad13e79=da32e0283f634a62937761a01c
0fb91d; expires=Mon, 12-Apr-2010 23:13:02 GMT; path=/
> Expires: Sun, 19 Nov 1978 05:00:00 GMT
> Last-Modified: Sat, 20 Mar 2010 19:39:42 GMT
> Cache-Control: store, no-cache, must-revalidate
> Cache-Control: post-check=0, pre-check=0
> Connection: close
> Content-Length: 142
> Content-Type: text/xml
> 
> <?xml version="1.0"?>
> 
> <methodResponse>
>   <params>
>   <param>
>     <value><string>10</string></value>
>   </param>
>   </params>
> </methodResponse>
> 
> 

As you can see, there is still a bit of work needed to beautify the raw request if its XML (probably by parsing out the Content-Type header), but the script is usable right now, so that will be something I will do in the future.

Another use case I have tried is to put it between a client HTTP GET call and a Drupal webpage. I still need to test this stuff extensively through various use-cases - if I find bugs in the code, I will update it here. Meanwhile, if you find bugs or strange behavior (or even better, bugs with fixes :-)), would really appreciate knowing.

Update 2010-04-04: The code here breaks down when (a) the request/response payload is large and/or (b) servers are slow. I am trying to fix the code, will post the updated code when done.

Update 2010-12: Because of its unreliability, I basically abandoned the script in favor of this less general solution. Mikedominice was kind enough to prod me into looking at this again. As a result, I ended up rewriting most of the script, and it seems to work quite well now. The updated code is posted above.

11 comments (moderated to prevent spam):

mikedominice said...

I've been trying to use your solution (great idea by the way!), but when I run your code as a proxy between a Client App and a Server App (as named in the diagram), I'm finding that the Client Application I'm using (either a web browser or wget) is simply receiving the request it sent as a response, and the request from the Server Application is getting lost. Any hints?

Sujit Pal said...

Thanks Mike. I looked at the code again (after a long time, had forgotten about this, thanks for the reminder), and ended up basically rewriting it. I have posted the updated code and made the necessary updates in the descriptions. Can you please try this out and see if it works for you?

Akash Talole said...

how to use SSL for connecting i write code for it but not working. reactor.listenSSL(sourcePort, HttpServerFactory(targetHost, int(targetPort)), ssl.DefaultOpenSSLContextFactory('server.key', 'server.crt'))

Sujit Pal said...

Hi Akash, I haven't used SSL so not sure about this, sorry.

Akash Talole said...

Hi Sujit I used SSL but problem is that web page content is not loading only response headers are getting by proxy. I think we have to use buffer for storing the data . When i removed self.transport.loseConnection() from dataReceived function in DebugHttpClientProtocol class then it works but if page content size is to large then it not loading the webpage. Can you tell me about how to use buffer to store the data when it is received. And this problem is for both HTTP and HTTPs connections.

Akash Talole said...

Hi Sujit, Both HTTP and HTTPS connections now working fine. I have changed the code of DebugHttpClientProtocol() class. Thanks a lot for you. Your code supports concurrent request handling. I tried using apache benchmark tool to check this code and got great result. Thank you so much.

Sujit Pal said...

Hi Akash, thats very good news. If you would like to share the fix or have a pointer to your code on GitHub or other public resource, let me know and I will be happy to include it here so others can benefit from your fix also.

Akash Talole said...

Hi Sujit, here is the link for modified code. https://github.com/akashtalole/Twisted-Proxy-.git One issue is performance of the proxy. I tested proxy using apache bencmark below is the comparison time in seconds
concurrncy requests Direct Webserver Only Proxy
10 10000 44.292 90.14
20 10000 50.991 85.952
30 10000 46.928 100.852
40 10000 38.849 86.201
50 10000 89.179 106.521

Sujit Pal said...

Thank you for the link. Some overhead is expected for the proxy since its doing two more hops.

Akash Talole said...

One reason of slowing down the proxy is when SSL connections splits the response data in DebugHttpClientProtocol() dataReceived () method .So how we can use buffer to store response data into buffer and then write to client. If this can be done then there is very less overhead.

Sujit Pal said...

Well, if by buffering you mean store the debug information into a buffer until it becomes full before printing it, then there is a risk that the behavior becomes "laggy". Given that the proxy is supposed to be for development (so you can see whats going over the wire), the current performance may be fine?