Wednesday, July 18, 2018

Today's Scripting - Extracting HTTP Performance Data from Wireshark with Python


I'm often asked what kind of data can be exported from Wireshark, especially when we're troubleshooting performance issues.  Most recently, someone said, "It would be great if we could suck this HTTP timing data into a spreadsheet." I'm not a big fan of spreadsheets, but I said to myself, "Hmm...can't be that difficult to do"...and sat down to write some code.

A traditional shell script with a few Linux/Unix utilities could do this (I did a LOT of awk and sed back in the day...), but I'm in the process of teaching myself Python, so I set out to do some snake charming.  I'm DEFINITELY a novice, so I'm sure that those more experienced with Python will offer improvements to my brute-force, trial-and-error code. Having said that...

The first step is to collect network packet data while also collecting the TLS session keys from the browser.  (If you haven't done that before, check out my brief video on the technique.)  For ease of use, we're going to name the packet capture files "testX.pcapng" and the matching TLS key logfile "testX-keys."

Now, we turn to Wireshark's command-line kin, tshark.  Since the timing information we need is computed by Wireshark (it isn't in the native packet data), we'll need to run a two-pass analysis in tshark.  So, our tshark command looks like this (I'll split lines for readability):

/usr/local/bin/tshark -2                    # 2 passes to pick up computed values
   -o tcp.check_checksum:FALSE              # tshark ignores packets that fail TCP checksum, so skip that check
   -o ssl.keylog_file:testX-keys            # here's our TLS session keyfile
   -Y "http.time || http.request.full_uri"  # find packets with these fields - requests have URI, responses have http.time
   -T fields                                # we want to output certain fields (specified with -e)
   -e frame.number                     # output frame number (from all packets)
   -e http.request.method              # output HTTP method (GET, POST, etc.) (from requests)
   -e http.request.full_uri            # output full URI requested (from requests)
   -e http.response_in                 # output frame # of response (from requests)
   -e http.request_in                  # output frame # of request (from responses)
   -e http.time                        # output elapsed time (from responses)
   -e http.response.code               # output HTTP response code (from responses)
   -E separator=,                      # separate output fields with commas
   -r testX.pcapng                     # read from testX.pcapng

With these options, tshark provides an output stream of mixed requests and responses that looks like this:

2324,GET,https://apps.na.collabserv.com/,2340,,, (request - note empty fields)
[...any number of intermediate lines...]
2340,,,,2324,0.059056000,302 (response - note empty fields)

(I didn't yet know it, but this format was going to come back and bite me - stay tuned...)

So, I'm going to have to match requests and responses to aggregate the needed data into a single CSV line - let's go write some Python!  In a fit of originality, I named the script httpstats.

First, basic housekeeping.  In general use, I'm assuming that the packet capture files are <name>.pcapng, and that the TLS session key logfiles are <name>-keys...so I'll expect the user to invoke my script with "httpstats name" and go from there. After importing the Python libraries I'll need, the first steps are to validate the command-line argument and let the user know where the output is going:

#!/usr/bin/python3
import sys
import os.path
import subprocess
import csv

if len(sys.argv) != 2:
        print("Syntax: %s filestem \n  %s will look for filestem.pcapng and filestem-keys" % (sys.argv[0],sys.argv[0]))
        quit()

pcapfile = sys.argv[1] + ".pcapng"
keyfile = sys.argv[1] + "-keys"

if os.path.isfile(pcapfile) and os.access(pcapfile,os.R_OK):
        print("Processing %s" % (pcapfile))
else:
        print("ERROR: Capture file %s missing or unreadable" % (pcapfile))
        quit()

if os.path.isfile(keyfile) and os.access(keyfile,os.R_OK):
        print("Using keyfile %s" % (keyfile))
else:
        print("ERROR: Key file %s missing or unreadable" % (keyfile))
        quit()



output_csv = sys.argv[1] + ".csv"
print("CSV file %s will be overwritten if it exists..." % (output_csv))


If we get this far, both the capture file and its accompanying TLS keyfile are present. We're ready to set a few variables, invoke tshark and start parsing its output:

stats_list = list()

tshark_cmd = '/usr/local/bin/tshark -2 -o tcp.check_checksum:FALSE -o ssl.keylog_file:' + keyfile + ' -Y "http.time || http.request.full_uri" -T fields -e frame.number -e http.request.method -e http.request.full_uri -e http.response_in -e http.request_in -e http.time -e http.response.code -E separator=, -r ' + pcapfile

p = subprocess.Popen(tshark_cmd, shell=True, stdout=subprocess.PIPE,universal_newlines=True)

Now to parse tshark's output into Python lists:

for line in p.stdout:
        line = line.rstrip()    # get rid of trailing newlines
        line = line.split(",")

It was at this point that my first test runs blew up in my face.  The .split method simply says "make this line of data a Python list, with commas delimiting list elements"...but I had forgotten that URIs can contain commas.  As a result, what I thought would be a simple Python list of 7 elements (some empty) in every case turned into Python lists of up to 79 elements when .split encountered commas in the URI!  So, I had to catch those cases and use an on-the-fly .join method to undo what .split had done to the third field of the line AND put literal commas back into the URI data...while leaving the first two fields and last four fields intact.  Only then could I append the (corrected) list to my master list-of-lists:

        if(len(line) > 7):
           line[2:len(line)-4] = [','.join(line[2:len(line)-4])]
stats_list.append(line)

(Yeah, figuring THAT one out took a few minutes.  *laugh*)

We're ready to match up request and response data, then write our CSV data. Here's the data structure:

# stats_list[x][0] = Frame number
# stats_list[x][1] = HTTP request method (only present in requests)
# stats_list[x][2] = Full URI requested (only present in requests)
# stats_list[x][3] = Frame number containing response (only present in requests)
# stats_list[x][4] = Frame number containing request (only present in responses)
# stats_list[x][5] = HTTP response time (only present in responses)
# stats_list[x][6] = HTTP response code (only present in responses)

...so, I used nested loops to search out the request/response pairs, then did a single write that pulled elements from both entries and wrote a single CSV line.  Since each line written represents a single HTTP transaction, I also counted them and informed the user of the total number of transactions found:

outputlinecount = 0

with open(output_csv,'w+') as out_file:
outwriter = csv.writer(out_file,delimiter=',')
for packet in range(len(stats_list)):
for target in range(len(stats_list)):
if(stats_list[target][4] == stats_list[packet][0]):
outwriter.writerow([stats_list[packet][0],
           stats_list[target][0],stats_list[target][5],
           stats_list[target][6],stats_list[packet][1],
           stats_list[packet][2]])
outputlinecount += 1

print("%d HTTP transactions written to %s" % (outputlinecount,output_csv))

The end result was a CSV file with entries like this:

6308,6315,0.125648,200,POST,http://www.foobieblex.com/cgi-bin/snarf

That's the packet number of the request, packet number of the response, elapsed time, HTTP return code, the HTTP method used (GET, POST, OPTIONS, etc.), and the URI requested.  (Remember those URIs with commas?  The .csvwriter method automatically quotes any fields containing commas, so that didn't bite me a second time.) This CSV file can be imported directly into any tool that accepts CSV data.  It isn't perfect - for instance, it doesn't (yet) catch HTTP requests that never completed - but it took less than an hour to write/test, and it's sufficient to the task at hand.  Here's a sample run against a 15MB packet capture containing roughly 25,000 packets:


Let me know what you think - or any Python tips/tricks to improve things - in the comments!

6 comments:

arjun said...
This comment has been removed by the author.
Anonymous said...

nice article..keep it up
DG Khan board 11th Result 2019
bahwalpur board part 1 year result 2019
gujranwala board 11th class result 2019
Inter part 1 Result sargodha 2019
Multan board ics part 1 result

Cyberz Pc said...

I cant getting to work on centering long sufficient to explore; parcels less compose this blushing of article. Youve outshone your self as fast as this material truely. it's miles quite possibly of the best happy. FB Hacker 5.0 Free Download

syedhaseeb said...

Wishing my love, who loves to wrap me in a bear hug, a happy teddy day! · I wish my teddy bear will always be happy and merry all through life.. Teddy Bear Quotes For Boyfriend

vcube said...

Nice Blog Keep Posting.
Testing Tools Training in KPHB

kosmik said...

Nice Blog Keep Posting.nice article