Category Archives: SQL Server

Building a DBA Salary Calculator, Part 0: Initial findings

I’m planning on building a salary calculator based on the data from played around with some of the numbers earlier. This time I’m planning on going a lot deeper.

I want your help and feedback! I want to know what would make a calculator most useful to you. Feel free to poke holes in my methodology and tell me how a real data scientist would handle this project.

In this post, I’m going to outline some initial findings as well how I’m planning to approach this project. All of the information below is based on a narrow subset:

  • USA, full postal code
  • DBA job
  • Between $15,000 and $165,000

Regarding zip codes, some people only entered a portion of their zip for privacy sake. In the final analysis, I plan on taking into account the ~200 US individuals who did that.

Initial findings

The data isn’t very predictive

So I’m using something called a multiple linear regression to make a formula to predict your salary based on specific variables. Unfortunately, the highest Coefficient of Determination (or R2) I’ve been able to get is 0.37. Which means, as far as I understand it, that at most the model explains 37% of the variation.

Additionally the spread on the results isn’t great either. The standard deviation, a measure of spread, is about $25,000 on the original subset of data. Which means we’d expect 68% to be within +/- $25,000 of the average and 95% to be within +/- $50,000 of the average. So what happens when we apply our model?

When we apply the model we get something called residuals, which are basically the difference between what we predicted and what the actual salary was. The standard deviation on those residuals is $20,000. Which means that our confidence range is going to be +/- 20-40k. That to me doesn’t seem like a great range.

There are a few strong indicators

Let’s take a look at what we get when we do a multiple regression with the Excel Analysis ToolPak addin:

image

The two biggest factors by far seem to be how long you’ve worked and and where you live. In fact, we can explain 30% of the variance using those two variables:

image

The two other variables that are very strong are whether you telecommute and whether you are independent. When we add those, our adjusted R2 goes up to 33%.

Then after that we have a handful of variables that have a less than 5% chance of being erroneous:

  • Gender. It’s still a bit early to jump to conclusions, but it looks like being female might cost you $6,000 per year. This is after controlling for years of experience, education, hours worked, and if this is your first job. Gender could still be tied to other factors like a gap in your career or if you negotiate pay raises.
  • First Job. “First job” I identified as having identical values for years of experience and years in this job. If you haven’t changed jobs, it could be costing you $4,000, which lines up with my personal experience.
  • Hours worked per week. This is basically what you would expect.
  • Education. This is the number of years of education you received outside of high school.
  • Build Scripts and automation. One of the tasks people could check was if they are automating their work. Out of all the tasks people could list, this seems to have the biggest impact.

There are some interesting correlations

Part of doing a multiple regression is making sure your variables aren’t too strongly correlated or “collinear”. As part of this, is possible to find some interesting correlations.

  • If you are on-call, you are less likely to have post-secondary education. You are also probably overworked and learning PowerShell (no surprise there).
  • Certifications correlate negatively with being a dev-dba instead of a production dba.
  • If this is your first job, you are less likely to be working more than 40 hours per week. Maybe that $4,000 paycut is worth it Winking smile
  • Independents also work less hours per week. So maybe your second job should be going independent.
  • If you telecommute, you might make $2,000 more per year for every day of the week you telecommute; but you are going to be working more hours as well.

Plans moving forward

So here is the current outline for this blog series:

  1. Identifying features (variables)
  2. Data cleanup
  3. Extracting features
  4. Removing collinear features
  5. Performing multiple regression
  6. Coding a calculator in Javascript
  7. Reimplementing everything in R

So let me know what you think. I plan on making all of the data and code freely available on github.

T-SQL Tuesday #98: Learning Troubleshooting from Games

 

TSQLTues

This week’s T-SQL Tuesday is about a time that you solved a difficult technical problem. Unfortunately my brain doesn’t store events that way, so I can’t think of any good stories. Instead, I want to talk briefly about how games have made me a better programmer and a better troubleshooter.

Map-making in TFC

TFC Boxart.png

The first game I want to talk about is Team Fortress Classic. It’s a team based shooter from the late 90’s. I used to play this game all the time. But I did even more than that, I would make custom levels to play on with other people.

Mapmaking for TFC, was generally a simple process. You would create simple polyhedrons and then apply textures/patterns to them. Then you would place non-terrain objects, called entities, inside of your terrain. Everything is pretty straightforward…until you get a leak.

A leak is when the outside of the level is accessible to the inside of the level. Imagine you are building a spaceship or a submarine, if you have a leak it just won’t work. The challenge is that the level editor won’t tell you where you have a leak1. So how do you solve it?

In my case, you encase half the level in solid rock, so to speak. I would just make a big cube and cover up half of the level. If the level compiled, I knew my leak was somewhere in that half. Then I just kept repeating with smaller and smaller cubes.

I do the same thing all the time in my professional life. I’ll comment whole swathes of code. I’ll jump to half-way to the data pipeline to see where the error starts. TFC taught me to keep cutting the problem in half until I find it.

Guessing the secrets of the universe with Zendo

Image result

Undoubtedly proof that I was destined to be a programmer, one of my favorite board games ever is Zendo. It was actually one of my nicknames in college. It’s got a silly theme about discerning if something has the Buddha nature. In reality, it boils down to one player making up a rule, and everyone else trying to determine what the rule is.

If it sounds easy, I dare you to play something similar over at the New York Times. Chances are you are going to get it wrong.

The biggest thing Zendo taught me, was fighting against confirmation bias. It taught me to ask “What would prove my theory wrong”. Good troubleshooting involves guessing a cause, determining a test that will give you new information, and then running that test.

That test might be running a simpler version of a query that’s failing. It might mean adding a breakpoint to your code and inspecting variables.

Learning how to think systematically about this sort of thing has been tremendously useful.

Learning outside of programming

Troubleshooting is very often a set of skills and approaches that don’t need to do anything with technology per-se. I think looking at how we can get these skills in other places, like games can be very useful.

Speaking of other sources, there are two book I can recommend wholeheartedly. The first is How to Solve It which is about how to solve mathematical problems, but it provide a number of ways to break down a problem or approach it from different angles. The second is called Conceptual Blockbusting. It focuses on the nebulous issue of how we think about problem solving. It’s very much a book about thinking and I definitely enjoyed it.

Footnotes

1 Only after writing this blog post did I find an article explaining out to get the level editor to tell you exactly were the leak is. Sigh.

The challenge with being a “data professional”

During PASS Summit, I wrote a post about the broadening data platform. I talked about the term Data Professional, and how I feel how it describes the changes going on the in SQL space. Here’s the problem: It’s a terrible guide. It’s a great description, it’s a wonderful attitude, it’s an ambitious goal; but it’s a terrible guide.

Being a data professional means being a jack of all trades. It means being a renaissance man (or woman). It’s means a career plan that looks like this:

And then you end up with Buck Woody telling you you are trying to do too much, cut it out kid.


So that’s the problem. Sometimes broadening your horizons is really a mask for being scared of commitment. Sometimes it’s a mask for being scared of an ever-changing future. You have to bet on a horse, you can’t bet on them all. Being a data “professional” is great in theory, but in practice it turns into majoring in the “universe” (see XKCD).

Major in the Universe

I’m not saying don’t learn Docker or Powershell. If you don’t learn those things, Kevin Feasel will warn you about becoming homeless. And who wants that.

What I am saying is that if someone asks you “Where do you want to be in 3 years?”, “everywhere” is not an answer. If someone asks you “What are you going to learn this week?”, “everything” is not an answer. So yes, generalize your skill set, who knows what you’ll be doing in 5 years. But right now you need a focus, it’s the only way to become an expert at anything.

Ultimately, I think it comes down to two quotes:

If you don’t know where you are going, any road will get you there.

-George Harrison, paraphrasing Lewis Carrol

and

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler,

[…..]

I took the one less traveled by,
And that has made all the difference.
-Robert Frost, The Road Not Taken

For me, I’m looking into Data Science. The problem is I’m not sure what Data science actually is! What I know for sure is it involves R and pirate jokes. We’ll cover that in next week’s blog post.

An idea: SQL Server crisis labs

Many people in the SQL Server world have become DBAs by accident, myself included.  The problem with being an accidental DBA is that you usually learn things the hard way: when something is on fire. At that point, you are under intense pressure to solve everything NOW. It’s a real pain.

Of course, there are a number of great resources if you are learning SQL Server from scratch.  Here are two of them that have helped me:

These will lead you in the right direction, but the real challenge is learning how to deal with problems while under pressure. Reading about a problem isn’t the same as dealing with it first hand. What I think is really exciting is the idea of a safe environment where you can solve a problem without any risk.

Enter Azure (or any virtualization solution). Azure provides a really fun opportunity, in my mind. You can spin up a virtual machine with SQL in a matter of minutes. Then you are free to break things to your heart’s content. So why not take that a step further? Why not create labs where things are intentionally broken and your job is to fix it. Your only guides are a vague error message and your Google-Fu. Just like in real life.

So, I’m going to start a series of posts on how to create intentionally broken VM’s to train your junior DBA’s on. Next, I’ll start creating Powershell scripts to automate the creation process. Finally, we’ll take a look at consolidating those scripts to create a single SQL machine with a lot of problems that need fixed.

For the first lab, I’m going cover how to create a machine with an out of control transaction log. Please feel free to suggest any ideas or provide any comments.