timstall

Thursday, August 20, 2009

10 tips to integrate CodeSmith into your processes

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/10_tips_to_integrate_codesmith_into_your_processes.htm]

Say you've theoretically seen why code generation is so profitable, so you've downloaded a free trial of CodeSmith, and banged out a few templates. In other words, you've got code generation working on a single developer machine. That's great, but it's even better to have it adopted by the entire department. Here are some practical tips on how to integrate CodeSmith into your processes.

Aim for active regeneration - There are two kinds of generation, Active and Passive. Active means that the code is actively regenerated on a regular basis. Passive means it was generated just once, and then modified manually thereafter. The problem with Passive generation is that it lets developers create tons of code upfront, but then people get trigger-happy and use the generator to produce even more code, and you're stuck now maintaining it all. It's like a trip with no return ticket. It also misses out on many of the other benefits of codeGen - like mass updating code with some new change.
Always have a batch script - Yes, people can integrate into VS, or use the CodeSmith IDE. But to enforce uniformity, ensure that the right properties are passed in, and hook into your Continuous Integration (CI) build, you'll need a batch.
Run the codeGen from your CI build - This enforces active regeneration.
Consider not checking generated scripts into source control - This prevents synchronization errors between local developers and the build server. Yes, all code should ultimately be checked into source control, which is why we still check the templates themselves in, from which you can deterministically recreate all the target code. Your automated checkout script, which gets the latest from source control, can then run the templates and recreate the target code. NOTE - this only works if you're not using merge regions (which mix generated and custom code in the same file). If you use merge regions, then you need to check in the generated files.
Avoid merge reasons where possible - CodeSmith has this powerful feature called "merge regions" which lets you mix both generated and custom code. Sometimes you need this, but if you have a choice - always opt to put generated code in its own, dedicated file. This prevents synchronization issues, is less likely to break, is easier to handle overwriting files in active generation, and is easier for most developers to understand and maintain.
Ensure that code generation can be run on every developer's machine - Because you'll want to actively re-generate the code, you'll need each developer to be able to run those generation batch scripts locally. That means each developer will need a license for CodeSmith. This is absolutely not the place to be stingy. If developers cannot simply make a change and have the code re-generated, they will revolt against using code generation.
Clearly identify the generated files - Make sure an average developer can quickly identify that a given file is code-generated. You could name the file with a "*.CodeGen.cs" extension, put a comment disclaimer at the top ("//This file is code-generated. Any changes will be overwritten"), and not check the target code into source control so that it doesn't have any overlay icons (like what SVN offers).
Know your overwrite strategy - If the target file already exists (because you're actively re-generating), make sure you know the expected behavior. If you don't use merge regions, you can simply overwrite the file. Source control should be smart enough to see that the file has the exact same content, and hence it shouldn't be a burden. Worst case, you can have your generator, before it writes out the generated context, detect if the target is the same or not, and handle appropriately (not write anything, have your build server throw a synchronization error if they're different, etc...)
Don't output the DateTime or user info into the code - When someone first uses a code generator, it can be tempting to add as much "free" details into the target file - like displaying "//This code was generated by Homer, on August 20,2009 at 11:34 pm". You'd never maintain that by hand, so it initially looks cool to see all that crisp information in your file. However, the problem is that every time the code is regenerated, that kind of information changes, and the code continually appears to be updated. Furthermore, such comments don't give you anything that you couldn't already get from source control.
Have a backup developer - Make sure that at least one other developer on the team can use the generating tool.

Monday, August 17, 2009

When should you use Code Generation?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/when_should_you_use_code_generation.htm]

Like everything, using code generation is a tradeoff. Basically, you want codeGen when the cost of writing the templates is less than the costs of writing what the template generates. For example, codeGen rocks at creating and maintaining tedious plumbing code - data access plumbing is the canonical example.

You should probably consider code generation when the target code:

Has similar patterns and deterministic rules, and little deviation from those rules. If there's an MS word doc or wiki page providing detailed instructions on how to write certain code, you may be able to send those instructions to the code generator instead.
Is large and brittle, and it can be described easily (i.e. 1 line in an xml config fiile to describe = 20 lines of C# and SQL coding).
Requires syncing with external data sources (like ADO.Net plumbing based on the database schema, such as var type, size, parameter order, count).
Requires multiple files to be kept in sync (like a SQL stored proc being in sync with C# ADO.Net wrapper).
Requires in-depth knowledge of a problem domain (like ADO.Net for data-access plumbing).
Is continually updated (like adding new classes to your DAL).
May need to be expanded in the future (like adding a whole new layer of webservices, or Audit triggers, to your DAL. Or, perhaps even migrating to a new language).

Code Generation is not a "Golden Hammer" - while it's great, it's not the perfect solution for everything. codeGen may not be the best solution if the target code:

Is very custom with no general pattern. If you can't abstract a pattern out of the target code, then you won't be able to write a generation template.
Is too small and trivial - In general, codeGen should decrease your total lines of code. So if you're writing a 50 line template to produce a single 30 line C# file, it's probably a bad ROI.
Can be refactored away, and doesn't need to exist in the first place.

Like everything else, in some contexts, it just isn't profitable - but in other contexts it's awesome. Some problems are cheaper to solve using other techniques, like unit testing, automation, open-source, DSLs, or other techniques; but every advanced developer should have code generation in their tool belt.

Monday, August 10, 2009

How Zip Codes can get complicated

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/how_zip_codes_can_get_complicated.htm]

I mentioned in a previous post that while states (in an address) seem simple - indeed most developers have made a dropdown to "pick your state" - in legacy apps they can quickly get complicated. Same thing applies to zip codes. It sounds like a secondary afterthought - "Oh, just add a field to the application so we can store numbers like 20500." However, it can quickly snowball:

Do you store the 4 digits at the end, like "20500-0003".
If you remove the spaces and dashes, you're left with just numbers - which seems easier to store and search on. So do you store it as an integer (205000003)? This might work if you're only looking at cities in the midwest or west coast, but some east coast states use zip codes that start with a "0", which would get truncated if stored as a number. Personally, I prefer to store them as a varchar, and then have a UI validation (for new) and backend scrubbing process (for existing) to standardize the format in the database.
Do you enforce valid zip codes only? Not every 9-digit combination of numbers is a valid zip code - i.e. there are not 1 trillion distinct codes that actually are used.
Many applications assume that one zip code belongs to one state - but there are scenarios where a single zip code is shared by multiple states (seriously).
Do you allow your zip code field to store international postal codes? Many US applications start off small, and only worry about the US market. Then some business sponsor says "we're missing out on the Country XYZ market, quick, update the app to handle foreign cities". This often causes changes to an address screen, and the quickest way to change it may be providing an "Out-Of-Country" option for the state dropdown, and allow the zip code to store international postal codes.
And do you handle all this in the UI with a rich control, or just use a "flexible" 10-character textbox?

Thursday, August 6, 2009

What can go wrong with changing a text label?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/what_can_go_wrong_with_changing_a_text_label.htm]

What could go wrong if the business sponsor just wants to change some little text on a label? It sounds like the simplest thing. And while it should be a simple change, it can sometimes get complicated (i.e. expensive).

Globalization - If the app needs to support multiple languages, then the new text needs to be translated into all those target languages.
Special Characters - The new text could support special characters that need to be encoded (like & or < or >).
Non-supported characters - The new label introduces a special character not in the original character set (such as some foreign language), and the application only supports basic ASCII.
Changes flow layout - The new label is longer or shorter, which changes the flow layout. For example, the new text is long enough to force wrapping (which pushes a row to high) or it doesn't wrap so it pushes a column to wide.
It's an image - Perhaps the label is actually an image (for fancier looking designs), not just text.
Text is not determined in the presentation layer - Perhaps the label is set dynamically through code or configs, such as pulling from a meta-data dictionary.
Text was dynamically built - Perhaps the label was dynamically concatenated via some existing logic (for example, it pluralizes the text if some count > 1), so you're not just setting a literal string anymore.
You don't own the text - Perhaps the label comes from an embedded, third-party component that you don't own and cannot easily change.
The label appears in multiple places - Perhaps the label appears in multiple places, and so all places need to be updated. For example, it also appears in a UI Report writer where you select the UI-friendly name instead of the database schema column/table.
(Bad design) - other code depends on the label text - Perhaps the application has bad design, and there's actually code that reads the value of the label to determine if it should do some other action. For example, if the label contains some word X, then hide section Y (as opposed to whatever method sets label to word X also hides section Y).
New font - Perhaps the new text actually introduces a new font that isn't available on the client (this isn't merely changing text, but it's associated with that kind of request).

Wednesday, July 22, 2009

Thinking that you can learn it all

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/thinking_that_you_can_learn_it_all.htm]

I think .Net is too huge for one person to learn it all, and it's just getting bigger - like a galaxy. However, sometimes an optimistic developer may get the temporary delusion that they can learn it all, or at least the parts that matter. How could someone become so optimistic?

Unexpected free time.
Something came easier than expected.
You're on a prestigious fast track project.
A real good teacher explained something very well (and quickly).
You're kidding yourself - you're just skimming, or only looking at buzzwords, not really digging into the tech.

Basically, if things are temporarily going well (i.e. you're absorbing new concepts really fast), it may be tempting to think that "ah ha, this learning thing has finally clicked, and it will always go fast from now on!" Oh, how I wish...

Sunday, July 19, 2009

Would you still write unit tests even if you couldn?t automatically re-run them tomorrow?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/would_you_still_write_unit_tests_even_if_you_couldnt_automa.htm]

I am constantly amazed at how difficult it is to encourage software engineering teams to adopt unit testing. Everyone knows (wink wink) that you should test your own code, and we all love automation, and all the experts are pushing for it, and we all know how expensive bug fixes are, etc... Yet, there are still many experienced and good-hearted developers who simply don't write unit tests.

I think a critical question may be "Would you still write unit tests even if you couldn’t automatically re-run them tomorrow?"

Here's why - most managers who push unit tests do so saying something like "Yeah, it's a lot of extra work to write all that testing code right now, but you'll sure be glad in a month when you can automatically re-run them. Oh, and by the way, you can't go home today until you fix these three production issues."

The problem is this demotes unit testing to yet another "invest now; reward later" methodology. This is a crowded field, so it's easy to ignore a new-comer like "unit testing". Obviously, most devs live in the here and now, and they'll just trying to survive today, so they care much more about "invest now, reward now".

The "trick" with unit testing - at least with basic unit testing to at least get your foot in the door - is that it adds immediate value today. Even if you can't automate those tests tomorrow, it can often still help get the current code done faster and better. How is this possible?

Faster to develop - Unit testing is faster to developer because it stubs out the context. Say you have some static method buried deep within your web application. If it takes you 5 minutes to set up the data, recompile the host app, navigate to the page, and do whatever action triggers your method being called - that's a huge lag time. If you could write a unit test that directly calls that method, such that you can bypass all that rigmarole and run the static method in 5 seconds - and now you need to test 10 different boundary cases - you've just saved yourself a good chunk of time.
Think through your own code - Unit testing forces you to dog food your own code (especially for class-library APIs). It also force you to think out boundary conditions - per the previous question, if it takes several minutes to test one usage of a function, and that function has many different boundary cases, a time-pressed developer simply won't test all the cases.
Better design - Testable code encourages a more modular design that is more flexible to change, and easier to debug. Think of it like this: in order to write the unit test, you've got to be able to call the code from a context-free, class library; i.e. if a unit test can call it, then so could a windows service, web service, console app, windows app, or anything else. Every external dependency (i.e. the things that usually break in production due to bad configuration) has been accounted for.

Even if you could never re-run those unit tests after the code was written, they are still a good ROI. The fact that you can automatically re-run them, and get all the additional benefits, is what makes unit testing such a winner for most application development.

Sunday, July 12, 2009

The address's State field may contain more than just the 50 states

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/the_addresss_state_field_may_contain_more_than_just_the_50_.htm]

Most business applications eventually ask the user to enter an address. There's the user's address, shipping address, their company's address, maybe an emergency contact's address, address history, travel-related addresses, financial addresses, etc... Most addresses have a city, state, and zip. While city and zip seem simple (more on that later), many devs initially expect the "State" field to be simple - perhaps a two-character column that can store the 50 US states. However, it can quickly balloon to something much more complicated (especially if you're troubleshooting some legacy app). Besides the standard states, it could contain:

Military codes (reference)

Armed Forces Africa	AE
Armed Forces Americas (except Canada)	AA
Armed Forces Canada	AE
Armed Forces Europe	AE
Armed Forces Middle East	AE
Armed Forces Pacific	AP

US Possessions (reference)

AMERICAN SAMOA	AS
DISTRICT OF COLUMBIA	DC
FEDERATED STATES OF MICRONESIA	FM
GUAM	GU
MARSHALL ISLANDS	MH
NORTHERN MARIANA ISLANDS	MP
PALAU	PW
PUERTO RICO	PR
VIRGIN ISLANDS	VI

Perhaps Canadian provinces? (reference)

Alberta	AB
British Columbia	BC
Manitoba	MB
New Brunswick	NB
Newfoundland and Labrador	NL
Northwest Territories	NT
Nova Scotia	NS
Nunavut	NU
Ontario	ON
Prince Edward Island	PE
Quebec	QC
Saskatchewan	SK
Yukon	YT

Generic codes to indicate international use?

Foreign Country	FC
Out of Country	OC
Not Applicable	NA

Or, specific applications may try their own proprietary international mapping, like "RS" = Russia. This might work if you're only doing business with a handful of countries, but it doesn't scale well to the 200+ (?) existing other countries (i.e. I would not recommend this. Use a "Country" field instead is feasible).

Special codes to indicate an unknown, or empty state?

Perhaps, for some reason, the application developer isn't storing just 2-char codes, but rather integer ids that map to another "States" table, so you see numbers like "32" instead of "NY" ("New York")?

Or, even worse, they're shoving non-state related information into the state column as a hack that "made something else easier".

How many distinct entries could you have?

With 26 letters, you've only got 26 ^ 2 = 676 options.
If you use numbers too, you've got (26 + 10) ^ 2 = 1296 options.
If you start using lower case letters (SQL is case-insensitive, but maybe this impacts managed code), then you've got (2 * 26 + 10) ^ 2 = 3844 options.
Add in some special characters (such as spaces, periods, asterisks, hyphen, underscores, etc...), maybe 10 of then (if the column isn't validated on strictly alpha-numeric), and you've got (2 * 26 + 10 + 10) ^2 = 5184 options.

That's potentially 100 time more than just the 50 US states. Of course, for new development, we'd all prefer some clearly-defined schema with referential integrity and a business-sensible range of values. However the real world of enterprise applications is messy, and you have to be prepared to see messy things.

It sounds simple, but that innocent "state" field can quickly get very complex.