Paul's Internet Landfill/ 2014/ Sysadmin Skills

Sysadmin Skills

I found myself ranting about this topic the other day (sorry Aden) and since I rant about this topic often I might as well get it down in electrons.

The topic is the IT field, and in particular the skills it takes to be successful in this field. Some people think you need to be up to date on all of the latest, greatest technologies to be successful. Some people think you need to have worked with technology since you were six months old to be successful. Some people think that getting a lot of certifications is the key. I think that all of these are misguided. As far as I can tell, there are three primary things you need to be successful:

  1. The ability to model systems in your head. This helps answer the question of "what is the computer trying to do?". I call these "modelling skills".

  2. The ability to ask questions and make hypotheses that will help clarify and verify those models. This helps answer the question of "is this really what is going on?" I call these "troubleshooting skills".

  3. A toolkit that allows you to conduct experiments that test these hypotheses. I call this "the troubleshooting toolkit".

Everybody thinks that the toolkit is the important part, because the toolkit is the domain-specific knowledge related to computers. Some things in my toolkit:

These are all things I picked up on the job. Unfortunately, these are the very things that industry certifications obsess over -- information that is useful for a short while, but becomes obsolete within years or months, and which very often can be looked up online as needed.

Some of you might be wondering why "Google" (or a comparable search engine, such as DuckDuckGo) is not on that list. When I am involved with job interviews and the question of "how would you solve this problem" comes up, the most common response is "Google. You can Google anything!" There is an element of truth to this, but I would list search engines as a tool in the toolkit. It is a very important tool, but search engines are only useful if you know what to search for, and you only know what to search for if you can ask effective questions for the search engine to answer. In particular, Google is no substitute for the first two skills.

The first two skills are related. They may in fact be the same skill, because in order to develop a model for what is going on you have to ask sensible questions, and the questions you ask are determined by the model in your head. The model in your head does not need to be a correct model (when troubleshooting a problem, my initial guesses to the problem are often wrong) but the modeller needs to be flexible, so that the model can be updated (or discarded) based on new evidence.

A Troubleshooting Story

Is this too abstract? Let us consider an example from my own life, which is sufficiently recent to be painful and (I hope) does not come across as being too braggy. Here's the story: I was setting up some network shares in a remote building, connected to our main building via a virtual private network. I did not want access to these shares to be slow, so I had set up a server (let's call it the "branch server") in the remote building that would host the shares. But access was still really slow!

Here are some nerd details: I was using DFS (Microsoft's Distributed File System) to make the remote building's shares look as if they were part of the tree of folders in the main building. The branch network was a different site (but the same domain) in Active Directory, so the clients on the branch site should have preferred servers on their site. The branch shares were hung off of the DFS tree using linked folders. Unfortunately, the DFS roots were set up poorly, and there were a lot of local (unlinked) folders directly under the DFS root.

The model in my head was that because the shares were hosted within the local building, access would be fast. But now I had evidence that this was not so. Something about my model was wrong.

I started out by spending a lot of time with a search engine, trying to learn how DFS worked, and asking dumb search queries like "DFS slow access on queries". This turned up some useful information about the inner workings of DFS (which I mostly skimmed) and a lot of information that seemed irrelevant. I was able to determine that this information was irrelevant because I had a sense of this server's setup. From time to time people would suggest tools to run for diagnostic information, and I ran those tools. This exercise built up two things: a model in my head about how DFS works, and some tools I could run to diagnose DFS problems. It also took much longer than it should have.

At some point it occurred to me to test whether DFS was the problem, so I tried accessing the shares locally: \\branchserver\sharename instead of \\org\sharename . Behold! Access to the shares was noticably faster. This made me suspect that something related to DFS was the problem.

In response, I pulled out a tool commonly used to diagnose network problems: wireshark. This showed me all of the calls that were coming in and going out of the network as I tried to access the shares. To my horror, I discovered that every time I navigated a folder in the share (which was hosted on the branch server) there were a bunch of SMB queries made to the server in the central building, which was connected over a relatively slow VPN. This broke my model.

My next job was to figure out whether the slowness was because I had misconfigured something, or because the problem was unavoidable. After some more web searching, I learned something about "DFS namespace servers" (which I had set up but not understood). Apparently clients query these namespace servers a lot. Fortunately, I could set up multiple namespace servers, but unfortunately I suspected that doing so would mirror all of the content from the central server on the branch server (which I definitely wanted to avoid). Would this be safe? I did not know, and search engines were being unhelpful.

I ended up taking a risk and setting up the second namespace server. A more prudent action would have been to set up a test DFS share to see whether data replicated automatically with multiple namespaces, but I am a mediocre and careless sysadmin. I got lucky, though. The second namespace server (set up on the branch server) worked.

But the problem was still not solved, because my clients were still referring to the original namespace server over the VPN. I pulled out my toolkit again, and thought that rebooting clients might fix the the problem because caches would clear. That didn't help. Then I tried replicating Active Directory information across the VPN. I do not know whether that helped or not; there seemed to be no effect, but then things started working after a while; running dfsrutil /pktinfo showed that the clients became aware of both namespace servers, and then eventually they started preferring the new namespace server. Problem solved! Problem solved in an unsatisfying way!

A Troubleshooting Story Analysis

To an outside observer, this reads like a long boring story about a lot of search engine queries, followed by typing in some weird magical computer incantations to fix the problem. This is not exactly wrong (I made a lot of search engine queries, and I did type in some magic to try and fix the problem) but it is misleading. If you read people's "war stories" blogging about troubleshooting problems, they almost always have this structure:

We obsess over the specific tools and the specific fixes, but it is actually the story that is the important part.

Here are aspects of modelling the problem that were helpful in solving this issue:

Here are some of the hypotheses and troubleshooting questions that I went through:

These questions may sound like they require a lot of nerd knowledge about technology. The questions are in the context of technology, and so necessarily use technological jargon. But the kinds of questions that come up are similar across domains.

And then there were the technical tools that I used to help answer the questions above:

Out of these three categories, which do you think is the easiest to test for in an industry certification? Which do you think are easiest for a Human Resources bureaucrat (or a keyword-scanning bot) to screen for on a resume?

As an embarrassing sidenote, as I draft this blog post I am already forgetting the precise syntax of the troubleshooting commands I learned during this DFS troubleshooting experience. In my defense, I took notes throughout the troubleshooting process and so have some of that syntax documented somewhere, but I will have to look up that syntax when proofreading this entry later. (Yes, I do proofread entries occasionally.)

There is one detail of this troubleshooting story that I consider important, and which might go a some way towards explaining the behaviour of "computer nerds" to non-nerds: at the end of this troubleshooting saga, I felt unsatisfied, even though to an outsider it looked as if I had "solved the problem" -- network access to the shares used to be noticeably slow, and now it wasn't. The problem is that my understanding (and therefore my modelling of the situation) was frustratingly incomplete: turning the branch server into a namespace server seemed to help, but for a while the network clients did not acknowledge the existence of this second namespace server until suddenly they did. I had not found out (and still do not know) how clients find their namespace servers, or how to get clients to use a specific server. Was it the Active Directory refresh? Was it something else?

To many people -- including some systems administrators -- this dissatisfaction might seem trivial or even annoying. The referrals work now, right? Why not leave well enough alone? However, I worry that my incomplete understanding of the situation will bite me later. If clients suddenly get slow again because they use the slow namespace server across the VPN, what do I do? I do not understand what is going on, so I will not be able to troubleshoot the problem easily. This recognition that my model is incomplete is valuable; knowing that there are gaps in my understanding means I can try to fill those gaps and make the model better. It is possible to take this quest for complete understanding too far, especially when there is other work to be done. But this tendency to keep probing at computer configurations even after the "problem is fixed" is a feature, not a bug.

Let's say that I lacked the first two skills, and could neither model the problem nor ask questions to confirm or refute my model, but that somebody handed me a bucket of the tools I ended up using to diagnose the problem: a search engine, wireshark, dfsrutil and so on. My claim is that it would have been much more difficult to solve the problem, and maybe it would have been impossible. Without those skills I would not be able to model the problem, and so would have to resort to blindly typing in commands from the toolkit, and hoping something fixed the problem. I have found myself in this situation before, and I do not like it at all. If I do not solve the problem then people continue to suffer, and if something I type does solve the problem then I have little recourse when the problem reoccurs. Sadly, I have seen lots of people -- even people who "work with computers" or want to get "computer jobs" -- who are in exactly this situation. They have learned a lot of computer jargon and buzzwords, but have no clue about how to use that toolkit to solve the problem that faces them. In my experience, these people cannot solve problems unless somebody else does their thinking for them.

Now say that in addition to getting the bucket of tools, somebody also handed me an installable brain cartridge all about DFS. This cartridge would upload all of the information I learned from my search engine searches directly into my brain, as well as a textbook that went into even greater detail about the workings of DFS. Maybe then I do not need modelling or troubleshooting skills to set up these DFS namespaces, because the textbook will take me through the process step by step. Maybe the textbook will even tell me how namespace lookups work and how to get clients looking for different namespace servers. But if I run into a DFS bug, or I have to set up things in a way that is not covered exactly by the textbook, I am still hosed, because I will not be able to troubleshoot the differences well enough to make DFS do what I need it to.

Schooling and Credentialism

Now it is time to discuss a topic that seems like a digression but really isn't: educational systems and credentialism.

I believe we suffer from credential inflation. High school used to be enough to get people good jobs. Now they are just about worthless; it seems that just about any job you can get with a high school diploma is a job you can get without one, and they all seem to be low-wage, low-status service jobs. Once high school diplomas became worthless, undergraduate education became the new standard. Now even that is being eclipsed; getting an undergraduate degree can get your foot in the door in some fields (engineering, computer science) but in most fields you need a Masters degree to get more prestigious positions. The alternative to getting a Masters degree is to go to college after university, so that you pick up "practical" skills. It is not sufficient to get only a college degree; you need the university credential. But the university credential is not sufficient; you need the college supplement as well. It's ridiculous, and harmful, and keeps people in school for way too long, and serves mostly to line the pockets of educational institutions (have you noticed how much construction is happening at the University of Waterloo these days?). So overall I am pretty unhappy with the credentialism treadmill.

At the same time, I have a soft spot for my own undergraduate degree, which was in computer science. I feel that there are strong distinctions between the quality of university degrees in computer science, college diplomas in computer studies, and the reams of certifications you can get via some terrible private college. I feel that my degree in computer science helped develop and refine skills that are essential to being a successful systems administrator. That is not to say that I am a successful systems administrator; it is to say that I have noticed a qualitative difference between people who do have university education in the field and many (but not all) people who don't.

To be blunt, I suspect that undergraduate university educations are better at teaching essential IT skills than community colleges, and that community colleges are better at teaching these skills than terrible private colleges that focus on buzzwords and industry certifications. Non-terrible private certification colleges may exist, but I do not know of any yet, and the terrible private colleges in our area suck a lot of people (and money) into their terrible programs, so I am fairly unhappy with them.

On the surface, terrible private colleges ought to be the best at preparing people for careers in IT, and universities ought to be the worst. Terrible private colleges focus on the latest technologies (Exchange 2013! Windows Server 2012!) and buzzwords. People who have gone through terrible private colleges can list many buzzwords on their resumes. But the curriculum of those colleges appears to consist of a lot of rote memorization, and the practical exercises involve very shallow interactions with the technologies in question.

Here's one example: an exercise in Exchange 2013 might involve installing Exchange, setting up a user, and setting up a transport rule. Such an exercise would allow somebody to list "Microsoft Exchange 2013" as a skill on their resume (and some people go as far as to list it as a fluency) but the exercise itself is rote, and involves following a cookbook of technical instructions. Because the terrible private colleges have to cover so many buzzwordy technologies, they never go into great depth on anything. I also suspect that the students do not make too many mistakes during these exercises, because the exercises are set up so that students can work through them smoothly.

As a caveat, I guess I should mention that I have never attended a terrible private college, and maybe they are awesome. My knowledge is all second hand, based on talking to people who have gone through such programs, looking at some of the content in industry certifications, and examining the curriculum covered in these programs. My dislike could well be elitist snobbery, and I hope it is. But I have looked at some (dated) training materials for industry certifications, and they do not look promising (more about this below).

In contrast to terrible private colleges, most universities take a perverse pride in avoiding technologies that are useful to industry. In my first year undergraduate education, the computer language we used was "Object Oriented Turing", or OOT. I am fairly confident that OOT has not been used in any industry application even once, but the language was good enough to teach things like data structures (stacks, queues and friends), programming techniques such as recursion, and assist with program runtime analysis. This was no help at all when I was trying to get a summer job (perversely, listing MATLAB as a skill on that resume was much more helpful for that) but in the long term it was helpful, if only because it taught me that computer science is more than just programming. If I had learned a language that was useful at the time (C++ probably, or Delphi, or maybe C) then I am pretty confident that I would not be using that language today. My programming these days is limited to tiny scripts in easy scripting languages, but so many skills from my undergrad turned out to be transferable to systems administration (which was not covered in my formal undergraduate education at all). I did cheat and pick up systems administration skills while working part time at the Erindale Computer Centre (as did many of my computer science peers), but I am not confident that this experience alone would have been as effective as the experience combined with my studies.

Some of the courses that seemed least practical when I took them turned out to quite useful in retrospect. My second year theory and third year data structures courses were all theoretical; exercises and problem sets involved no computer programming at all. But these courses were invaluable in breaking my brain. Computer science proofs might not be as rigorous as other math proofs, but they still require understanding problems thoroughly and then being able to reason about them in a way that is understandable to others. High school students that learn procedural programming in class come to universities confident that they know how to program, until they are introduced to functional programming and their brains break trying to program in this new paradigm. Similarly, data structures classes open up new worlds of organizing data in ways that make certain operations fast at the expense of making other operations slow.

"Breaking one's brain" is a fancy term for being exposed to new ways of modelling information. The procedural programming model people learn in high school (and often learn in first year computer science courses) is one way to model programming. But there are many other models, and being exposed to those models forces students to think in different ways. That in turn helps them develop their modelling skills and forces them to be less rigid about the models they adopt -- both of which are invaluable, transferable skills in the "real world". At its best, undergraduate education breaks the brains of its students again and again.

How about community colleges? Being in the middle of the highly theoretical university education and the overly buzzwordy terrible private college experience, should they not be superior to both? I do not believe so. I believe that community colleges can be better than terrible private colleges -- they at least attempt to teach their students some programming skills, which implicitly develop modelling and probing skills. I believe that community colleges also try to help students use the tools in their toolkit (ping, traceroute, etc) in context. But overall I do not get the sense that they are as effective as universities are in developing modelling or troubleshooting skills. I could be wrong about this.

It is worth taking a moment to talk about troubleshooting skills. In my experience, troubleshooting skills are invaluable, but no computer education program teaches them explicitly (or rather, no computer education program does a good job of teaching them explicitly). In my experience, people end up learning troubleshooting skills on their own. You pick up troubleshooting skills quickly the night before a big programming project is due, because inevitably your programming project does not work right and you have to figure out what is wrong. If you cannot isolate problems and ask questions to understand better what is going on, then you won't have a working project when the deadline hits. But I do worry about this approach to teaching troubleshooting; I think a lot of people never pick up these skills despite (or perhaps because of) the pressure of deadlines.

Nature vs Nurture

Here is an alternative explanation: university programs might appear to do a better job of educating students in modelling and troubleshooting skills than community colleges or terrible private colleges, but this is all selection bias. Privileged kids with high marks go to university, and these kids either have better modelling and troubleshooting skills innately, or they learned these skills at a young age because they were surrounded by technology all their lives. Meanwhile, the skill levels of people attending community colleges is lower and those of terrible private colleges are lower still, so students do not demonstrate these skills when they graduate.

I worry about this scenario a lot. It would imply that modelling and troubleshooting skills cannot be taught in the classroom; people either have those skills or they don't. That would be a disastrous situation, because I know so many people who would like to get into the field of "computers" but do not possess strong modelling and troubleshooting skills. Similarly, there are tonnes of technology companies who are starving for employees, but cannot find people with the skills they need. If modelling and troubleshooting ability is innate and cannot be taught (and in particular, taught to people at an older age than typical university students) then there is little hope of bridging this digital divide. University educations are too expensive and too disruptive for people of limited means to access, and the intimidation factor is high: older people already feel out of place as undergraduate students, and computer science programs in particular require a lot of mathematical familiarity that older people have forgotten and must relearn.

I do believe that modelling and troubleshooting skills can be refined and improved to some degree, although I do not have citations to back this up. Anecdotally, I feel that I had been blessed with some of these skills going into university, but that going to university strengthened and developed these skills further. Certainly, going to university exposed me to modelling ideas I likely never would have learned elsewhere. To this day I regret not taking a university networking course; I have picked up bits and pieces of how networking works over the years, but it would have been useful to get a broader theoretic overview in a structured setting.

What I don't know how to do is how to develop these skills in a more informal setting, such as a Freeskool course or the computer recycling program at my workplace. People in these places try to teach troubleshooting skills, but I am not confident that a lot of it sticks. I think part of the problem -- as usual -- is the confusion between technology toolkits and buzzwords and the modelling and troubleshooting skills that are so easily overlooked.

Sysadmin Skills and Hiring

If there is one thing I have realized from working at an employment centre, it is that I despise the job search process, and that in many ways it is deeply broken. However, I feel that hiring practices are more broken than usual in information technology fields. As I alluded above, technology buzzwords are easy to screen for, and it is easy to turn a large pile of resumes into a small one by looking for the people who have technology skills that match the particular ones used at your company as closely as possible. In my opinion, this is idiotic. Buzzwords and specific technologies are the easiest things for people to pick up on the job, but if you hire somebody who has poor troubleshooting and modelling skills, then you have made a bad hiring decision regardless of how well-versed somebody is in a particular technology.

I do not think this opinion is even that controversial. I believe that people who work with technology in a deep way know how quickly technology changes, and knows that troubleshooting and modelling skills are much more valuable (and transferable) than knowledge in some domain. At best, requirements for deep domain knowledge ("Ten years of jQuery experience!" "Fifteen years of Drupal development with increasing responsibility!") are proxies for these other skills. If somebody has fifteen years of Drupal experience, presumably they have learned how to model and troubleshoot issues as well. But even this is broken in two ways.

Firstly, people lie on their resumes, and a listing of technology buzzwords is not a good indicator of modelling and troubleshooting ability. For example, I have developed the terrible habit of considering diplomas at terrible private colleges anticredentials, in the sense that I disbelieve all of the technology buzzwords listed by graduates of such programs unless they can demonstrate that they have actual understanding about the buzzwords they list. I try not to be discriminatory about this, but it is super difficult; I have been disappointed too many times.

Secondly, even if people have deep knowledge of a particular technology, there is no guarantee that they will be able to do more than "hit the ground running". In five years the technologies the industry uses will probably be different, and if these people ith "deep knowledge" are not able to pick up those changes, their effectiveness will be reduced.

The real problem is that it is really hard to test for the skills we really care about when hiring. Some people attempt to solve this using technical interviews, but that tends to privilege particular technologies instead of generalized skills. For a while, companies went through a fad of asking troubleshooting and "creative thinking" questions not related to a particular domain ("How would you estimate the number of tennis balls that would fit in the CN tower?") but these questions leaked onto the Internet and job applicants started memorizing the techniques and answers, which defeated the purpose. I am not sure that these kinds of questions are a good test either; they might help you catch the brightest people who can think the quickest, but there are lots of people with good troubleshooting and modelling skills who need time and research in order to do their jobs. Job interviews are terrible environments for giving their job applicants some time and research to consider their answers, which is just one reason job interviews are terrible environments.

I regretfully admit that I have participated in hiring decisions. I have made many, many mistakes in doing so. In additional to "organizational fit" (which is coded language for "are you nice and will you get along with others in our cultlike environment?") I am mostly concerned with discerning whether people applying for IT jobs have modelling and troubleshooting skills. Unfortunately, I am terrible at discerning these things. My current technique mostly consists of probing: asking people to relate their own troubleshooting stories, and then asking followup questions to get a sense of how they modelled their problems and what kinds of questions they asked to get closer to their answers. (Yes, future job applicants, I suppose you can exploit this revelation now.)

There are a few reasons for not obsessing over the contents of particular technological toolkits. As I have written many times already, I believe these skills are relatively easy to transfer, either via the Internet or from coworker to coworker. Secondly, we use such a mishmash of technologies at work that finding anybody who is familiar with any large fraction of them is going to be unreasonable. Thirdly, we dramatically underpay staff members at the cult (particularly with respect to IT wage scales), so being too picky about technological buzzwords is foolish.

Maybe it is just because we dramatically underpay people, but among our job applicants I have found very few who appear to possess good modelling and troubleshooting skills. I do not know exactly why.

Science

Sysadmin skills ought to be teachable, because (if you have not noticed already), my formulation of sysadmin skills is effectively the same as the scientific method:

Experiments in the systems adminstration world usually require less data analysis than experiments in "real" sciences. In addition, we care less about making sure we have controls for our experiments, although the familiar heuristics of "change only one thing at a time" and "only try things you can reverse" in the systems administration world are attempts to make experiments more controllable.

We believe we can teach the scientific method. So why are we (in particular, why am I) so bad at teaching the scientific method in context of computer skills?

Except for the particular domain knowledge involved in coming up with a troubleshooting toolkit, I believe these skills are helpful in many different fields. Things go wrong in every area of human endeavor, and when things go wrong somebody has to solve the problem. That requires troubleshooting ability.

Developing Sysadmin Skills

Maybe you are a person who has read this far and actually believes what I have written, but would like to develop these skills. What can you do? Is it hopeless?

Here are some techniques that I believe can help:

In particular, I do not believe that reading a lot of technology books (or trendy technology websites) is that helpful. In my experience, I only learn when I have a specific problem to solve, and I have to put in effort to solve that problem. Passively reading information can help you gather buzzwords for your technology toolkit, but I feel that your effort is much better spent doing something actively.

Domain-Specific Knowledge

I have been pretty harsh on the "troubleshooting toolkit" in this article in favour of emphasizing modelling and troubleshooting skills. Largely this is to counterbalance the way modelling and troubleshooting are underemphasized when talking about "computer people" and the skills they should possess. But domain knowledge is important as well, and developing an effective troubleshooting toolkit for your domain (or domains) is critical. It does no good to model a network problem and wonder whether clients are slow to connect to their servers if you have no mechanism for actually finding that information out. At some point, solving my DFS namespace problem meant picking up a lot of domain-specific knowledge about how DFS works.

It has taken me years to develop some domain knowledge about troubleshooting techniques in the Windows ecosystem, and my knowledge remains inadequate in so many ways. But I can confidently say that developing a deeper systems administration toolkit has made it much quicker and easier to troubleshoot problems now than it has in the past. In some ways, this is why systems administrators get paid so much: because they have information about computers that the rest of us don't, and because they can use that knowledge to solve problems quickly. Comprehensive technological toolkits complement modelling and troubleshooting skills, not oppose them.

Other Skills

Of course, I am oversimplifying. There are no doubt many other qualities and skills that are useful in holding down a job as a systems administrator successfully. Here are a few that come to my mind:

Having said that, I still assert that the three skillsets I cover in this article are necessary to being an effective systems administrator, even if they are not sufficient.