As my entire career as a sysadmin (~7 years) has been within academia, you’d think that by now I’d be a master of collecting, plotting and analyzing data. However, I wasn’t bred in academia and the fact that I work where I do is more of a circumstance than anything else. I was never properly taught very much about data collection, plotting and analysis beyond high school and anything I can practically use today is because I was required to learn it to get the job done or to try and prove a point. I’ve always been able to find a way to whip out xmgrace or generate simple plots with gnuplot but it’s never been something that I’m super confident with, especially being surrounded by people who live and breath this stuff day in day out.
So why bother with knowing anything about this whole plotting thing? It’s clear how it can be useful in monitoring-style applications where data points are collected over time and then visualized via a plot or graph. Such plotting exposes trends in our environments and that’s usually a helpful tool to have around. Of course, there are other more specific problems and/or questions where collecting, plotting and analyzing data is very helpful as well. I will do my best to describe one such example.
Over the last few days I’ve been trying to find an answer to the question:
“Does the VPN add latency to our remote NX connections and if so, is it significant?”
This is a question where I believe plotting data will prove useful. There are some other sub-questions I’d like answered as well but that is the overarching issue at hand. I realized that this would be a great opportunity to re-learn some of the basics and maybe try out a few new tools at my disposal so I decided to document my journey through this foreign land for all to criticize and enjoy.
! Scientific Method
Of course, I’m not following a strict scientific method with this endeavor. The question simply doesn’t warrant an entire drawn out, highly statistically relevant result despite my best intentions in delivering exactly that. What I’m trying to do is get an accurate sense more than an exact measurement, as flawed as that might be. It’s all I can justify in terms of time and effort for this project. From that strictly academic point of view, I’m sure to fail. My hope is that the results will be pseudo-science’d enough to provide confidence in my answer and that I’ll improve my skills throughout the exercise.
In order to determine if the VPN is affecting our latency I need at least two tests:
- NX connection without VPN
- NX connection with VPN
But while I’m at it, I figured I would gather additional data in order to attempt an answer at other RTT related questions. Adding additional tests based on client system “location” (local LAN, local wireless, various locations on campus wireless, home internet connection, etc. and NX compression settings (MODEM, ISDN, ADSL, WAN and LAN) greatly increases the amount of testing required but will provide for richer data to visualize.
On top of that , each one of these additional variables I am testing is to also be tested with and without VPN. To add even more tests, each one of these combination of tests needs to be performed multiple times in order to normalize the data and to increase the statistical relevance. More samples = better data = more accurate results (at least this is the hope).
In order to start analyzing data, I need data. And that data needs to of be quality. And to have quality of data, I need multiple samples. And to make useful comparisons I need multiple variable data sets and at least one control data set. For all that to work, I needed a reproducible set of actions to generate traffic, collect data and extract the relevant parts.
My basic method is as follows:
- Configure wireshark or tcpdump on the remote host to capture packets related to the NX/SSH connection that we are testing. Capture filters are used to prevent capture of any other packets.
- Initiate NX connection to remote host (login)
- Perform predefined action X on remote host via NX
- Logout of NX connection from remote host
- Stop and save packet capture
- Export RTT statistics from capture file with tcptrace
- Extract only the RTT data from `tcptrace` output (discard the TCP sequence # column because the absolute value doesn’t matter, we’ll use the index for the x-axis)
- Label and save extracted RTT data as txt format for input to plotting function
Scatter plots are basically used to visualize at least one data set with two display values. In this case, plotting the round trip time (RTT) in milliseconds by the corresponding TCP sequence number for various data sets. What’s more interesting though is juxtaposing combinations of data sets against each other in order to quickly visualize and observe qualitative differences.
Histograms are a way of visualizing the distribution of data set. In this case, a histogram will plot the number of TCP sequences at each millisecond increment in the data set. Visualizing the distribution of our data set will help to clarify what the least to most frequent round trip times are, something which cannot be quickly visualized in a dense scatter plot.
Looking Forward to Part 2
Now that you’ve made it through the snooze-fest that was part 1, I hope you’re eager for part 2! Oh boy! More blabbering, right? Hopefully not. Part 2 is where I’ll share some scripts, tips, techniques and finally, some finished plots for all to behold. You know, the technical stuff that we all love.
It shall be grand, now I just need to write it…
Comments are highly welcome.