timstall

Friday, December 2, 2011

10 Reasons why the build works locally but fails on the build server

This is a braindump:

1. Developer did not check all the files in, or developer doesn't have the latest files (sometimes TFS hiccups getting latest dlls files).

2. Different modes (release vs. debug). Either #if DEBUG, or project is unmarked in configuration manager.

3. Different bin structure - each project gets its own (Default for visual studio), vs. single shared bin for all (default for TFS). This is especially common when different versions of the same assembly is referenced in multiple projects in the same solution.

4. Different platform/configuration

5. The build is running other steps (perhaps a packaging or command-line unit tests)

6. Different bitness, say developer workstation is 64-bit, but build server is 32-bit, and some extra step breaks because of this.

7. Rebuild-vs-build. Developer not running a rebuild. Hence there's an error in creating a dll, but it already exists on dev machine due to some other process, but build server fails.

8. Workspace mapping is incorrect – TFS not getting all the files it needs

9. Unit test code coverage – visual studio (at least 2008) can be very brittle running command line unit tests and code coverage.

10. Treat warnings as compile errors – depending on your process, the build server may fail on these, but Visual studio may only flag you with a warning (which dev ignores)

Tuesday, November 29, 2011

Why I'm liking Pluralsight

My department had scheduled to send each of us to training. We had different training classes, and the specific vendor for my class needed to cancel. That left me short notice to squeeze in different training by year end. So, being creative, I got an online subscription to Pluralsight instead of the traditional training.

PluralSight is a set of online .Net videos created by industry experts. Each video seems 2-4 hours' worth of power point slides and code demos. It's worked out very well. What I'm liking so far:

· Different medium - After 10 linear feet of books, I like the different medium. Hearing someone's voice seems to trigger a different part of the brain for remembering, and seeing the demo from end-to-end has obvious benefits over isolated screenshots in a book or article.

· It's on-demand – It's hard to make it to physical events. I like the inherent benefit of on-demand training, where I can listen on my schedule (by which I mean everyone else's schedule - my kid's sleeping schedule, my company's work schedule, etc…)

· Professional content - There are tons of free videos online, but these are often like reactionary scraps. To break the ice with a new technology, it helps to have a systematic 2-hour block that goes from end-to-end.

· Track progress – Some personality types won't care about this, but I like how it tracks completion progress through courses. It's almost like finishing levels of a video game.

· Coordinated – I don't need 10 videos all telling different or rehashed angles of the same thing (which is often what I'd find in a google search) – rather I need one good video that nails it, or a collection of videos that each explains their specific part well.

· Continually Improving – They seem to come out with a few new "courses" every week.

It's getting to the point where rather than watch my favorite sitcom, I watch the next Pluralsight video.

Sunday, November 27, 2011

Measure what you actually care about

Our three kids are currently 2, 4, and 6. We are starting to potty train the youngest. She's a cute thing, but you can imagine it's always a trying experience. Because I'm very anti-ivory-tower, and think the best developers are the ones grounded in the practically of everyday life (such as potty-training a two-year old), I can't help but think how this relates to software engineering.

Here's how: we found ourselves rewarding our daughter every time she successfully went potty. It sounded reasonable, but we remembered that it's actually misleading – we're rewarding the wrong thing. What we really want is not a two-year old that goes potty every 20 minutes in order to earn her chocolate-chip, but rather a two-year old that remains dry. Even two-year olds can figure out how to game the system.

This sort of misguided measurement is what often occurs in demoralized IT shops. For example, the main compensation is based on the number of bugs fixed or number of UI screens created (because it's easy to measure), but what they actually care about is increased functionality or quality. The irony is that this often encourages the exact opposite of what the boss really wants. Just like I don't want a two-year old going "tinkle" every 20 minutes, I don't want developers gaming the system by fixing large quantities of irrelevant or duplicate bugs.

The blog post doesn't have a quick answer, I mainly just wanted to write about my daughter's potty-training adventures while she was taking a nap. But a quick approach is to focus on what you actually care about (say quality), and then work backwards thinking "what would high quality look like", such as less production complaints, les support time, less application down time, less developer time spent fixing bugs, etc… Then focus how to measure those things.

Wednesday, August 3, 2011

Whatever requirements we're given tomorrow, we got to get that done

I've seen a thousand hacks justified with "We got to get it done". You know the drill – copy and paste 200 lines of code, hard-code data that should be configurable, skip any automated testing, etc… Such hacks come at the expense of future flexibility (i.e. good design).

However, ironically, given the continual feature change, scope creep, and unknowns in software development, the real question becomes "Whatever requirements we're given tomorrow, we got to get that done."

This second question, the more realistic one for long-term departments, brings completely opposite connotations than the first. Instead of cranking out a feature now with no concern for maintenance costs or flexibility tomorrow, developers need to prepare – i.e. ensure they have automation, builds, reuse, etc…

Besides technical debt, the other problem I have with the "just get it done now" crowd is the false sense of nobility. Often these devs insist that they're doing a good thing (putting out a fire), but really it's just punting the problem down the road for someone else to pay while they boast how quickly they've solved it.

Wednesday, July 27, 2011

Why you’re in trouble if you rely on 30-page SOPs

Every organization wants to have its development processes documented into Standard Operating Procedures (SOPs) for the obvious reasons – faster onboarding, standardization, auditing, etc… The Holy Grail is the potential to hire a bunch of contractors (or outsource), tell them to read a novel worth of documentation, and then they’re fully up to speed a week later. SOPs also imply that the team knows what it is doing and has a plan, which is one indicator of a mature organization. SOPs are also a prerequisite for outsourcing, an appealing option for large organizations.

While documentation has its benefits, you can’t rely solely on large documents to communicate process and onboard new people. Here are at least four common problems, and it will result in a frustrated and confused team:

The doc itself will be wrong (or outdated), such as skipping steps or assuming institutional knowledge. This is especially common if you hire outside consultants (with no institutional knowledge of your systems) to document your process.

It’s easier to bluff a doc – a busy tech writer will hurry the doc, thinking it looks done ("I’ve written 50 pages!"), but the content won’t be correct or specific enough.

Screens will vary (example: the software is upgraded or the OS doesn’t match).

People simply won’t read the docs, they’re skim and miss details.

Several ways to communicate an SOP instead of just 30-page MSWord docs:

Favor automation over documentation where possible. The best document is an automated script. A script is usually kept up to date (compared to a doc) because developers need the script to work. It’s also much faster (and less error-prone) for a new guy to kick off the script than to tediously step through 20 pages of instructions.

Lower the cost of documenting by leveraging a wiki. Developers are far more likely to update or correct a wiki then a big MSWord doc on SharePoint.

Favor simplifying the process so the doc itself is simpler.

Monday, July 25, 2011

If you're going to fail, fail big – example with dirty towels

After showering the other day, I needed a towel to dry off. I saw a bunch of towels in a nearby laundry basket, but wasn't sure if they were dirty laundry going downstairs, or clean laundry coming upstairs. I was too lazy to walk down the hall to get a towel from the closet that I knew would be clean, and the towels in the basket looked clean, so I used the ones from the basket because they were immediately available. But it didn't feel "right". I started feeling bits of lint and loose hair on me, so I did the most thorough test I could find – I phoned my wife and asked her if the towels in the laundry basket were clean or dirty. She informed me that they were "of course" dirty towels used as floor mats.

Like every other aspect of normal life, I see this directly analogous to software coding. The dirty towel is like a defect. If it was obviously dirty, I never would have used it, hence saving myself the grossness of drying myself with a floor mat. Applied to software: Better to have a defect explode in your face so that you're forced to fix it, as opposed to a "half-bug" that continually bites you.

Wednesday, July 20, 2011

Ghosts and Time Bombs

Having bugs that are reproducible on your local machine is a luxury. Enterprise production apps are often void of such luxuries. Indeed, often the reason the bug gets past developers, code reviews, QA, UAT, regression, and every other quality control measure is because it is not acting in an obviously deterministic way. Two common types of such bugs are "Time Bombs" and "Ghosts".

A Time bomb works perfectly, only to explode at some point in the future because it depends on some external dependency that eventually changes. These are usually deterministic, and can be reproduced if you know what you're looking for, but it's very hard to trace the exact cause. The temptation with the time bomb is that it's working perfectly right now, and everyone is always so busy, so they move on.

Examples of time bombs are:

· Dependency on the clock – Y2K was the most famous case. Other examples include code that doesn't account for the new year (say it sorts by month, and doesn't realize that January 2012 is greater than December 2011), or storing a total milliseconds as in Int32 (and it overflows after a month).

· Growing data – Say your system logs to a database table, and it works perfect on local and even QA tests where the database is constantly refreshed. But then in 6 months (after the developers have rolled off the project and no-one even knows about the log) the log file becomes so bloated that performance slows to a crawl and all the connections timeout.

· Memory leak – Similar to the growing data.

· Service contract changes or expires – In today's interconnected systems, it is common to have external data dependencies from third parties. What if a business owner or manager forgets to renew these contracts, or the schema of the contract fails, and hence the service constantly fails. Even worse – say you shell out to a third-party tool (with System.Diagnostics, and hide the window so there's no popup) that gives a visual EULA after such an expiration, and all you see if the process appears frozen because it's waiting for the (hidden) EULA?

· Expiring Cache – What if you store critical data in the cache on startup, but that data eventually expires without any renewal policy and the app crashes without it?

· Rare events with big impact – What if there's a annual refresh of an external data table? I've seen apps that work perfectly in prod for 8 months, processing some external file, and then unexpectedly explode because they're given an "annual refresh" file that is either too big, or has a different schema.

General ways to test for time bombs:

· Leave the app running for days on end.

· Forcibly kill the cache in the middle of running – will it recover?

· Do load testing on the database tables.

· Make sure you have archive and cleanup routines.

· Set the system clock to various values.

· Test for annual events.

Ghosts are bugs that work perfectly in every environment that you can control, but randomly show up in environments you can't. You can't reproduce them, and you don't have access to the environment, so it's ugly. Ghosts are tempting to ignore because they seem to go away. The problem is that if you can't control the ghost, then you can't control your own application, and that looks really bad to senior management. Examples of ghosts include:

· Concurrency, threading, and deadlocks – Because 99% of devs test their code as a single user stepping through the debugger, they'll almost never see concurrency issues.

· Environmental issues – For unknown reasons, the network has hiccups (limited bandwidth, so sometimes you app gets kicked out), or database occasionally runs significantly slower, causing your performance-tuned application to time out.

· Another process overwrites your data – Enterprise apps are not a closed system – there could be other services, database triggers, or batch jobs randomly interfering with your data.

· Hardware failures – What if the network is temporarily down, or the load balancer has a routing error (it was manually configured wrong during the last deploy?), or a disk is corrupt?

· Different OS or Windows updates – Sometimes devs create (and debug) on one OS version, but the app actually runs on another. This is especially common with client apps where you could create it on Windows 7 Professional, but it runs on Windows Vista. Throw in service packs and even windows updates, and there can be a lot of subtle differences.

· Load balancing – What if you have a web farm with 10 servers, and 9 work perfect, but the last one is broken (or deployed to incorrectly)? The app appears to work perfectly 90% of the time. Realistically, say it's a compound issue where the server only fails 10% of the time, then your app appears to work 99% of the time.

· Tedious logic with too many inputs – A complex HR application could have hundreds of test cases that work perfectly, but say everyone missed the obscure case that only occurs when a twice-terminated user logs in and tries to change their email address.

General ways to test for ghosts:

· Increase load and force concurrency (You can easily use a homegrown threading tool to make many web service or database calls at once, forcing a concurrency test).

· Simulate hardware failures – unplug a network cable or temporarily turn off IIS in your QA environment. Does the app recover?

· Allow developers some means for QA and Prod debug access –if you can finally reproduce that bug in prod (an nowhere else), the cheapest solution is to allow devs some means to troubleshoot it there. Perhaps they need to sit with a support specialist to use their security access, but find a way.

· Have tracers and profilers on all servers, especially web and database servers.

· Have a diagnostic check for your own app. How do you know your app is healthy? Perhaps a tool that pings every webservice (on each machine in the webfarm), or ensures each stored proc is correctly installed?