Wednesday, July 18, 2018

Today's Scripting - Extracting HTTP Performance Data from Wireshark with Python

I'm often asked what kind of data can be exported from Wireshark, especially when we're troubleshooting performance issues.  Most recently, someone said, "It would be great if we could suck this HTTP timing data into a spreadsheet." I'm not a big fan of spreadsheets, but I said to myself, "Hmm...can't be that difficult to do"...and sat down to write some code.

A traditional shell script with a few Linux/Unix utilities could do this (I did a LOT of awk and sed back in the day...), but I'm in the process of teaching myself Python, so I set out to do some snake charming.  I'm DEFINITELY a novice, so I'm sure that those more experienced with Python will offer improvements to my brute-force, trial-and-error code. Having said that...

The first step is to collect network packet data while also collecting the TLS session keys from the browser.  (If you haven't done that before, check out my brief video on the technique.)  For ease of use, we're going to name the packet capture files "testX.pcapng" and the matching TLS key logfile "testX-keys."

Now, we turn to Wireshark's command-line kin, tshark.  Since the timing information we need is computed by Wireshark (it isn't in the native packet data), we'll need to run a two-pass analysis in tshark.  So, our tshark command looks like this (I'll split lines for readability):

/usr/local/bin/tshark -2                    # 2 passes to pick up computed values
   -o tcp.check_checksum:FALSE              # tshark ignores packets that fail TCP checksum, so skip that check
   -o ssl.keylog_file:testX-keys            # here's our TLS session keyfile
   -Y "http.time || http.request.full_uri"  # find packets with these fields - requests have URI, responses have http.time
   -T fields                                # we want to output certain fields (specified with -e)
   -e frame.number                     # output frame number (from all packets)
   -e http.request.method              # output HTTP method (GET, POST, etc.) (from requests)
   -e http.request.full_uri            # output full URI requested (from requests)
   -e http.response_in                 # output frame # of response (from requests)
   -e http.request_in                  # output frame # of request (from responses)
   -e http.time                        # output elapsed time (from responses)
   -e http.response.code               # output HTTP response code (from responses)
   -E separator=,                      # separate output fields with commas
   -r testX.pcapng                     # read from testX.pcapng

With these options, tshark provides an output stream of mixed requests and responses that looks like this:

2324,GET,,2340,,, (request - note empty fields)
[...any number of intermediate lines...]
2340,,,,2324,0.059056000,302 (response - note empty fields)

(I didn't yet know it, but this format was going to come back and bite me - stay tuned...)

So, I'm going to have to match requests and responses to aggregate the needed data into a single CSV line - let's go write some Python!  In a fit of originality, I named the script httpstats.

First, basic housekeeping.  In general use, I'm assuming that the packet capture files are <name>.pcapng, and that the TLS session key logfiles are <name> I'll expect the user to invoke my script with "httpstats name" and go from there. After importing the Python libraries I'll need, the first steps are to validate the command-line argument and let the user know where the output is going:

import sys
import os.path
import subprocess
import csv

if len(sys.argv) != 2:
        print("Syntax: %s filestem \n  %s will look for filestem.pcapng and filestem-keys" % (sys.argv[0],sys.argv[0]))

pcapfile = sys.argv[1] + ".pcapng"
keyfile = sys.argv[1] + "-keys"

if os.path.isfile(pcapfile) and os.access(pcapfile,os.R_OK):
        print("Processing %s" % (pcapfile))
        print("ERROR: Capture file %s missing or unreadable" % (pcapfile))

if os.path.isfile(keyfile) and os.access(keyfile,os.R_OK):
        print("Using keyfile %s" % (keyfile))
        print("ERROR: Key file %s missing or unreadable" % (keyfile))

output_csv = sys.argv[1] + ".csv"
print("CSV file %s will be overwritten if it exists..." % (output_csv))

If we get this far, both the capture file and its accompanying TLS keyfile are present. We're ready to set a few variables, invoke tshark and start parsing its output:

stats_list = list()

tshark_cmd = '/usr/local/bin/tshark -2 -o tcp.check_checksum:FALSE -o ssl.keylog_file:' + keyfile + ' -Y "http.time || http.request.full_uri" -T fields -e frame.number -e http.request.method -e http.request.full_uri -e http.response_in -e http.request_in -e http.time -e http.response.code -E separator=, -r ' + pcapfile

p = subprocess.Popen(tshark_cmd, shell=True, stdout=subprocess.PIPE,universal_newlines=True)

Now to parse tshark's output into Python lists:

for line in p.stdout:
        line = line.rstrip()    # get rid of trailing newlines
        line = line.split(",")

It was at this point that my first test runs blew up in my face.  The .split method simply says "make this line of data a Python list, with commas delimiting list elements"...but I had forgotten that URIs can contain commas.  As a result, what I thought would be a simple Python list of 7 elements (some empty) in every case turned into Python lists of up to 79 elements when .split encountered commas in the URI!  So, I had to catch those cases and use an on-the-fly .join method to undo what .split had done to the third field of the line AND put literal commas back into the URI data...while leaving the first two fields and last four fields intact.  Only then could I append the (corrected) list to my master list-of-lists:

        if(len(line) > 7):
           line[2:len(line)-4] = [','.join(line[2:len(line)-4])]

(Yeah, figuring THAT one out took a few minutes.  *laugh*)

We're ready to match up request and response data, then write our CSV data. Here's the data structure:

# stats_list[x][0] = Frame number
# stats_list[x][1] = HTTP request method (only present in requests)
# stats_list[x][2] = Full URI requested (only present in requests)
# stats_list[x][3] = Frame number containing response (only present in requests)
# stats_list[x][4] = Frame number containing request (only present in responses)
# stats_list[x][5] = HTTP response time (only present in responses)
# stats_list[x][6] = HTTP response code (only present in responses), I used nested loops to search out the request/response pairs, then did a single write that pulled elements from both entries and wrote a single CSV line.  Since each line written represents a single HTTP transaction, I also counted them and informed the user of the total number of transactions found:

outputlinecount = 0

with open(output_csv,'w+') as out_file:
outwriter = csv.writer(out_file,delimiter=',')
for packet in range(len(stats_list)):
for target in range(len(stats_list)):
if(stats_list[target][4] == stats_list[packet][0]):
outputlinecount += 1

print("%d HTTP transactions written to %s" % (outputlinecount,output_csv))

The end result was a CSV file with entries like this:


That's the packet number of the request, packet number of the response, elapsed time, HTTP return code, the HTTP method used (GET, POST, OPTIONS, etc.), and the URI requested.  (Remember those URIs with commas?  The .csvwriter method automatically quotes any fields containing commas, so that didn't bite me a second time.) This CSV file can be imported directly into any tool that accepts CSV data.  It isn't perfect - for instance, it doesn't (yet) catch HTTP requests that never completed - but it took less than an hour to write/test, and it's sufficient to the task at hand.  Here's a sample run against a 15MB packet capture containing roughly 25,000 packets:

Let me know what you think - or any Python tips/tricks to improve things - in the comments!

VIDEO: Where in the World Are Your Users? Geolocation with Wireshark

You've probably seen websites that greeted you with something like "Oh, you're in New York City? Here's our local store" or asked to "know your location".  If you've ever wondered how they do that, the answer is IP geolocation.  It's an interesting technique...and you can apply it to your own network capture data in Wireshark!

It's a neat trick; I've known mobile service providers who used it to create a dynamic map of locations they were "currently serving", and I've worked with data center operators who used it to create a dynamic heatmap of transaction loads from different parts of the world.  The best part is that - at least for simple, introductory purposes - you can start working with it for free!

In this video, I'll demonstrate how to enable IP geolocation in Wireshark, export the data in CSV format, and upload it to a mapping provider.  Basically, we'll go from a packet capture to a worldwide contact map in about 12 minutes.

As always - if you enjoy the video, please consider giving it a YouTube like and/or comment!

VIDEO: Decrypting End-User SSL/TLS Browser Sessions with Wireshark

Given that just about everyone is using HTTPS these days (and well they should!), troubleshooting web applications can be a major pain when it comes to network-layer analysis.  Fiddler is a solid tool, but its man-in-the-middle approach to capturing HTTPS sessions doesn't work in many secure environments, thanks to certificate issues.  What if you could just grab the end user's browser sessions and decrypt those?  Well, you can!

In this video, I'll demonstrate how to collect TLS session keys from Firefox/Chrome, import them into Wireshark, and work with the decrypted data.

If you enjoy the video, please consider giving it a like and/or a favorable comment...