timstall

Tuesday, September 29, 2009

A quick overview of enterprise object caching

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/a_quick_overview_of_enterprise_object_caching.htm]

Caching is a performance mechanism where you store an expensive-to-create value for future consumption. For example, you may cache dropdown values to spare yourself expensive database hits. Caching is one of those buzzwords that everyone knows your application should have, but surprisingly few really do.

Books have been written on caching, so this is just a quick brain dump based on my personal experiences. Also note that I refer to "the database" a lot because it's the main data dependency that most developers can relate to, but really, it could be anything (web service, external file, etc...).

Frontend and Backend Caching

	Frontend UI	Backend
Where is it located?	Local (in process)	Remote (external machines)
Pro	Faster because it's local - no remote hit	Handles updating stale data in distributed systems Handles any CLR serializable object, independent of the UI layer.
Con	Does not handle updating values - data may be stale Limited to just HttpContext.	Slower because it's remote - you still need to pay for the remote hit.
Example	Asp.Net HttpContext.Cache	Memcache, Velocity, others...

Obvious follow-up question: "Would you double cache something, taking data from the backend cache and saving it to the fronend cache?" Sure, if it benefited your specific scenario. Ideally the backend cache is totally encapsulated anyway, so your frontend UI developer wouldn't even know if the data they're working with came from a cache or not.

What is a good candidate for caching?

Any data that:

Takes a lot of time to create, either due to a remote hit (to a database or web service), or a large calculation time (like querying a million rows).
Has a small final result - you query a million rows just to return a single value.
Does not change - the problem with caching is stale data.
Has minimal dependencies. If your object touches 10 tables for creation, then there's a much greater chance of it becoming stale. This is one benefit of loosely coupled (and then batched) objects, instead of spaghetti code.
Has many reads, but very few writes. For example, system-level data (that everyone constantly requests) is good for caching, but employee-level data may not be.
Is retrieved externally and requires high uptime - for example you make a web service call to get some value, you call the service again 5 minutes later, the service is down, and you really wish you had even a "stale" copy of that data.

The canonical example would be something like city-state dropdowns. Say it's initially a remote database hit to some "City/State" tables, it returns a small amount of data, it changes infrequently (The US has had 50 states for the last half-century), so it's read many times but not prone to stale data.

What is a bad candidate for caching? Pretty much, the opposite of the good criteria.

Pitfalls with caching

Merely creating the dictionary isn't the problem. The problem will be integrating it (seamlessly) into your data persistence layer, and then ensuring the cache doesn't become stale - especially across a distributed environment - and then making sure it actually improves your performance instead of degrading it.

Stale data is the bane of caching. There are at least a few ways to deal with stale data:

Apply a time-out policy such that all data expires after N minutes, but that won't be acceptable for most scenarios.
If your app is changing the data (such as updating a details page), then have the DAL method (that calls the update SQL) also ping the cache to clear the stale object. This requires that your application somehow keeps track of which objects depend on which piece of data. It works great with a domain model, which requires more design upfront, but can have huge payoffs.
If someone else is directly changing the database, such as a DBA running ad-hoc SQL in production, then consider providing some admin console that lets them clear segments of the cache. For example, if the DBA did a mass-update of all salaries, then have the admin page allow you to flush all salary-related objects out of the cache. This requires some infrastructure for tagging each cached object (perhaps a master config file), and discipline from the DBA to check that admin page.
Beware of "database cache dependencies" (ASP.Net 2.0?), that claim to let you apply triggers to the database table, and then automatically clear the appropriate cache items when a specific row/column is updated. I've personally never gotten this to successfully work, have heard lots of horror stories (especially when deploying it across a DMZ), and it forces the domain design into the database instead of the middle tier. Although I'm all ears to anyone who's had a success story here.

Some other things to keep in mind:

Where should my cache tie in? Ideally, you'd want the backend cache abstracted via your dataAccess layer. This becomes very feasible with CodeGeneration. Whether data is pulled from cache or the live database is just a tuning option. Just like you don't want to put in-line SQL throughout your codebehind pages, you probably don't want to tightly-couple all your UI code to your cache provider. The frontend cache can be referenced in your UI, but again, be aware of too much plumbing code that "designs you into a corner".
Only temporary - A cache is not a persistent data store, it is not merely a "mirror" to scale out your database, as you must account for the cache being cleared. A data access method should always have a means to recreate the cached data.
Why use a remote cache? If both the cache and database servers both have remote hits, what's the difference? In its simplest form, this helps scale out the database (which is usually a bottleneck). Every hit on the cache is one less hit on the already-overloaded database. Even after the remote hit, the cache can have a much faster lookup time because while the cache stores a created object, the database may still need to query 1 million rows to collect the data.
Control Panel - You'll want to provide an easy way to flush the entire cache, or even segments of the cache, without restarting any servers. It's also great moral support to have a statistics page showing how many thousand (million) database calls have been spared.
Configuration - You'll want to provide a way to configure almost everything: the cache durations, what category of duration (short, medium, long), which objects get cached, and perhaps even turn off the entire cache for emergency troubleshooting. Caching is something that you want to tune based on actual production results. Ideally control of what gets cached is all abstracted to a few easy-to-manage config files. You do not want these config values hard-coded throughout your app.
Beware of over-caching - If done the wrong way, caching can actually screw your performance. Say you cache something that is continually obsolete, so instead of just doing the normal database hit, you continually also need to do the extra cache query.

Good things to read

There's an endless list of info on caching. Here are some that could be useful.

The Pragmatic Programmers have a whole book on caching with memcached! (I've personally used Memcached before, and liked it.)
MSDN: Caching Architecture Guide for .NET Framework Applications
Enterprise Library Caching Application Block

Yes, there's ton more that can be said about caching. Again, this is just a quick brain dump.

Tuesday, September 22, 2009

ConnectionTimeout vs. CommandTimeout

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/connectiontimeout_vs_commandtimeout.htm]

SQL timeouts can be very annoying, especially for internal development tools where performance isn't critical.

A lot of developers will try fixing this by modifying the connection string by appending "Connect Timeout=300". Normally this is easy because the connection string is stored in some config file.

However, it still usually fails. This is because there's a big difference between SqlConnection.ConnectionTimeout and SqlCommand.CommandTimeout.

If you're running a command, like a snippet of SQL or a stored proc, then your code needs to set the CommandTimeout. Something like so:

SqlConnection con = new SqlConnection(strDbCon);
SqlCommand cmd = con.CreateCommand();
cmd.CommandType = CommandType.Text;
cmd.CommandText = strText;
cmd.CommandTimeout = 10000; //no relation to con.ConnectionTimeout

Obviously, that's very minimalist code, but that's the general idea. For a robust data access layer, you'd make the timeout configurable.

Sunday, September 20, 2009

I don't have time to put on the parachute

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/i_dont_have_time_to_put_on_the_parachute.htm]

If you're jumping out of a plane at 5000 feet - you take the time to put on the parachute. Sure, you could save yourself a few seconds, and for the first 1000 feet having that parachute doesn't matter yet when you're still free-falling ("I saved schedule time by skipping the 'put-on-parachute' task!"). However, by the time you hit the ground, that parachute is life-saving.

Same thing with software projects and best practices. The project is like jumping out of a plane, and the best practices are like the parachute. Sure, some "best practices" are just useless marketing buzzwords. However, others - like unit testing - are the real deal. And a developer or manager who rejects unit testing (for new .Net code in the middle tier) because they "don't have time" is like jumping out of the plane without putting on your parachute. You save a bit of time upfront, but when the project goes to production and "crashes" into reality, the maintenance costs and untested boundary cases will kill it.

"I don't have time" sounds much more noble and business-like than "I don't understand that idea", or "I just don't feel like doing something new." But within the (very common) context of new .Net class-library development - given that testing frameworks are free (NUnit, MSTest), and that any developer can at least write their tests locally - regardless of management support, and that in certain scenarios it's actually faster to develop code by writing unit tests (because it stubs out the context), the "I don't have time" reply to unit testing certain scenarios is misguided. Sure there are always exceptions, and reasons that good devs may not write unit tests. Testing databases, or legacy code, or the UI, or integration is a different beast. However, in general, testing public methods in a C# class library, without external data dependencies, should be as reasonable as putting a parachute on while jumping out of a plane.

Wednesday, September 16, 2009

LCNUG - Sept 24 - C# 4.0

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/lcnug__sept_24__c_40.htm]

Mike Stall (MSFT) will present on the new features in C# 4.0, including named and optional parameters, dynamic support, scripting, office interop and No-PIA (Primary-Interop-Assemblies) support.

This is a great event for those interested in C# 4.0.

Tuesday, August 25, 2009

What is a number?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/what_is_a_number.htm]

Having an application collect a number from the user seems simple. Just whip out a textbox, maybe add an "integer validator", and ta da! Sounds great, but endless things can go wrong (things can always go wrong, like with states, zip codes, or even labels). I'm certainly not saying that every textbox in every line of business app should handle all of this, but here are some things to be aware of:

Do you allow for special characters, like dollar signs "$", percents "%", or financial negative numbers "( )"? [Seriously, try formatting a negative dollar amount in Excel]
Do you allow commas?
Do you allow decimals?
Do you handle globalization? Some countries use the "," for a decimal point, and the "." for a comma (the inverse of what America does).
Do you allow both "-" (for negative) and "+" for positive, or is the "+" just implied by default?
Do you allow pre and post zeros: "001.00". Keep in mind that the zeros on the left are mathematically redundant, but the zeros on the right may count as "significant digits", and denote a degree of precision. The extra zeros in "1.25000" implies a much more precise number than just "1.25". However, most numeric types in .programming languages will truncate the trailing zeros.
Do you allow exponents for large numbers, like "1E+12" (that's 1 with 12 zeros after it: "1,000,000,000,000")?
Do you allow abbreviations, like "10M" for "10,000,000"? (FWIW, I'm not sure of any app that actually does this.)
Do you have a clearly defined range? For example, a smallint (Int16) stores much smaller numbers than a long (Int64). Most business values can be stored with a double. However, if you're extending a legacy system that only provides a byte or smallint, but it expects occasionally huge values, then you could get into trouble.
Is your validation lenient? For example, if the user enters multiple commas "23,5,6" - do you simply remove all the commas and have "2346", or is that an error?
How do you handle nulls? Do you use a sentinel value (like Int32.MinValue), or use the new Nullable data types, or something else? If using a nullable value, does it "convert up" - i.e. Int32.MinValue is null if the type if Int32, but it would be a valid non-null value if the type was Int64. If you have multiple systems reading the same value, and one system uses Int32, and the other uses Int64, will your system still work (this can happen when you start doing things like letting the user dynamically create their own pages or reports).
Not to be smart-alecky about it, but I'd just assume that unless otherwise told, this accepts base-10 inputs. I.e. "A5" and "FF" are invalid values - although there may be apps where that is legit (like specify a color in HTML for some blog or content-management software).

Also, does your system handle transformations? That's where you collect the input, store the raw data as something else, and display a newly formatted value. Perhaps the most common example is with percents. A user may type "50" into a textbox (with the "%" in the label next to it), you may save it in the database as "0.5", and you may format it back as "50%" on a report. Or a user may enter "+0" in a textbox, and you simple store and display it as "0".

So, some numerical inputs to test with (besides the obvious invalid numeric inputs, like "xyz") :

-23,456.000
-23.456,00 //Globalization
$45,345.00
$(4.00)
The max/min values for each of the numeric types.
Fill the entire textbox with 9's.
000000000
+23
23 //try the simple case
-0
+0
1.25E+12
1.25 E + 12 //note the spaces

The things to look for are (1) will these prompt validation messages, (2) how will each value be stored in the database, and (3) how will the value be re-formatted when it's displayed.

Perhaps the best standard is to see how Excel handles it, as Excel is probably the most famous application in the world that handles numbers.

Thursday, August 20, 2009

10 tips to integrate CodeSmith into your processes

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/10_tips_to_integrate_codesmith_into_your_processes.htm]

Say you've theoretically seen why code generation is so profitable, so you've downloaded a free trial of CodeSmith, and banged out a few templates. In other words, you've got code generation working on a single developer machine. That's great, but it's even better to have it adopted by the entire department. Here are some practical tips on how to integrate CodeSmith into your processes.

Aim for active regeneration - There are two kinds of generation, Active and Passive. Active means that the code is actively regenerated on a regular basis. Passive means it was generated just once, and then modified manually thereafter. The problem with Passive generation is that it lets developers create tons of code upfront, but then people get trigger-happy and use the generator to produce even more code, and you're stuck now maintaining it all. It's like a trip with no return ticket. It also misses out on many of the other benefits of codeGen - like mass updating code with some new change.
Always have a batch script - Yes, people can integrate into VS, or use the CodeSmith IDE. But to enforce uniformity, ensure that the right properties are passed in, and hook into your Continuous Integration (CI) build, you'll need a batch.
Run the codeGen from your CI build - This enforces active regeneration.
Consider not checking generated scripts into source control - This prevents synchronization errors between local developers and the build server. Yes, all code should ultimately be checked into source control, which is why we still check the templates themselves in, from which you can deterministically recreate all the target code. Your automated checkout script, which gets the latest from source control, can then run the templates and recreate the target code. NOTE - this only works if you're not using merge regions (which mix generated and custom code in the same file). If you use merge regions, then you need to check in the generated files.
Avoid merge reasons where possible - CodeSmith has this powerful feature called "merge regions" which lets you mix both generated and custom code. Sometimes you need this, but if you have a choice - always opt to put generated code in its own, dedicated file. This prevents synchronization issues, is less likely to break, is easier to handle overwriting files in active generation, and is easier for most developers to understand and maintain.
Ensure that code generation can be run on every developer's machine - Because you'll want to actively re-generate the code, you'll need each developer to be able to run those generation batch scripts locally. That means each developer will need a license for CodeSmith. This is absolutely not the place to be stingy. If developers cannot simply make a change and have the code re-generated, they will revolt against using code generation.
Clearly identify the generated files - Make sure an average developer can quickly identify that a given file is code-generated. You could name the file with a "*.CodeGen.cs" extension, put a comment disclaimer at the top ("//This file is code-generated. Any changes will be overwritten"), and not check the target code into source control so that it doesn't have any overlay icons (like what SVN offers).
Know your overwrite strategy - If the target file already exists (because you're actively re-generating), make sure you know the expected behavior. If you don't use merge regions, you can simply overwrite the file. Source control should be smart enough to see that the file has the exact same content, and hence it shouldn't be a burden. Worst case, you can have your generator, before it writes out the generated context, detect if the target is the same or not, and handle appropriately (not write anything, have your build server throw a synchronization error if they're different, etc...)
Don't output the DateTime or user info into the code - When someone first uses a code generator, it can be tempting to add as much "free" details into the target file - like displaying "//This code was generated by Homer, on August 20,2009 at 11:34 pm". You'd never maintain that by hand, so it initially looks cool to see all that crisp information in your file. However, the problem is that every time the code is regenerated, that kind of information changes, and the code continually appears to be updated. Furthermore, such comments don't give you anything that you couldn't already get from source control.
Have a backup developer - Make sure that at least one other developer on the team can use the generating tool.

Monday, August 17, 2009

When should you use Code Generation?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/when_should_you_use_code_generation.htm]

Like everything, using code generation is a tradeoff. Basically, you want codeGen when the cost of writing the templates is less than the costs of writing what the template generates. For example, codeGen rocks at creating and maintaining tedious plumbing code - data access plumbing is the canonical example.

You should probably consider code generation when the target code:

Has similar patterns and deterministic rules, and little deviation from those rules. If there's an MS word doc or wiki page providing detailed instructions on how to write certain code, you may be able to send those instructions to the code generator instead.
Is large and brittle, and it can be described easily (i.e. 1 line in an xml config fiile to describe = 20 lines of C# and SQL coding).
Requires syncing with external data sources (like ADO.Net plumbing based on the database schema, such as var type, size, parameter order, count).
Requires multiple files to be kept in sync (like a SQL stored proc being in sync with C# ADO.Net wrapper).
Requires in-depth knowledge of a problem domain (like ADO.Net for data-access plumbing).
Is continually updated (like adding new classes to your DAL).
May need to be expanded in the future (like adding a whole new layer of webservices, or Audit triggers, to your DAL. Or, perhaps even migrating to a new language).

Code Generation is not a "Golden Hammer" - while it's great, it's not the perfect solution for everything. codeGen may not be the best solution if the target code:

Is very custom with no general pattern. If you can't abstract a pattern out of the target code, then you won't be able to write a generation template.
Is too small and trivial - In general, codeGen should decrease your total lines of code. So if you're writing a 50 line template to produce a single 30 line C# file, it's probably a bad ROI.
Can be refactored away, and doesn't need to exist in the first place.

Like everything else, in some contexts, it just isn't profitable - but in other contexts it's awesome. Some problems are cheaper to solve using other techniques, like unit testing, automation, open-source, DSLs, or other techniques; but every advanced developer should have code generation in their tool belt.