timstall: architecture

Showing posts with label architecture. Show all posts

Thursday, October 1, 2009

What makes something Enterprise?

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/what_makes_something_enterprise.htm]

There's a world of difference between a prototype hammered out over a weekend, and an enterprise app ready for the harsh world of production. Here's a somewhat random brainstorm. In general (there's always an exception), Enterprise apps:

Are scalable - they handle large loads and can be called many times.
Have a retry strategy - for example, it tries pinging the external service three times before "failing".
Have a failover strategy, like an active-passive machine cluster for maximum uptime, and a disaster recovery site.
Send notifications.
Handles invalid data (like states, zip codes, and numbers).
Can integrate with other systems (perhaps providing web service wrappers, or command line APIs, or publicly accessible data repositories that other apps can modify) .
Are deployable - "it works on my machine" absolutely does not cut it.
Have Logging - this is especially useful for debugging in production, or measuring how many errors (and which types) are thrown.
Have long-running process (hours, days, or even weeks) - not just a single thread in memory.
Have async processes, which usually means concurrency and threading problems.
Support multiple instances of the app running. You can open two copies of word, or run two MSBuild scripts at the same time.
Handle product versioning.
Care about the hardware it's running on (enough CPU and memory).
Have a pluggable architecture - You may need to switch data providers (Oracle/SQL/Xml).
Have external data sources (web services, ftp file dumps, external databases).
Can scale out, such as adding more web servers, or splitting the database across multiple servers.
Have security constraints (both hacking, and functional).
Have process that are documented (not just for training, but also for legal auditing and compliance issues).

Much of this code isn't the fun, "glamorous" stuff. However, it's this kind of robustness that separates the "toys" from the enterprise workhorses.

Tuesday, September 29, 2009

A quick overview of enterprise object caching

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/a_quick_overview_of_enterprise_object_caching.htm]

Caching is a performance mechanism where you store an expensive-to-create value for future consumption. For example, you may cache dropdown values to spare yourself expensive database hits. Caching is one of those buzzwords that everyone knows your application should have, but surprisingly few really do.

Books have been written on caching, so this is just a quick brain dump based on my personal experiences. Also note that I refer to "the database" a lot because it's the main data dependency that most developers can relate to, but really, it could be anything (web service, external file, etc...).

Frontend and Backend Caching

	Frontend UI	Backend
Where is it located?	Local (in process)	Remote (external machines)
Pro	Faster because it's local - no remote hit	Handles updating stale data in distributed systems Handles any CLR serializable object, independent of the UI layer.
Con	Does not handle updating values - data may be stale Limited to just HttpContext.	Slower because it's remote - you still need to pay for the remote hit.
Example	Asp.Net HttpContext.Cache	Memcache, Velocity, others...

Obvious follow-up question: "Would you double cache something, taking data from the backend cache and saving it to the fronend cache?" Sure, if it benefited your specific scenario. Ideally the backend cache is totally encapsulated anyway, so your frontend UI developer wouldn't even know if the data they're working with came from a cache or not.

What is a good candidate for caching?

Any data that:

Takes a lot of time to create, either due to a remote hit (to a database or web service), or a large calculation time (like querying a million rows).
Has a small final result - you query a million rows just to return a single value.
Does not change - the problem with caching is stale data.
Has minimal dependencies. If your object touches 10 tables for creation, then there's a much greater chance of it becoming stale. This is one benefit of loosely coupled (and then batched) objects, instead of spaghetti code.
Has many reads, but very few writes. For example, system-level data (that everyone constantly requests) is good for caching, but employee-level data may not be.
Is retrieved externally and requires high uptime - for example you make a web service call to get some value, you call the service again 5 minutes later, the service is down, and you really wish you had even a "stale" copy of that data.

The canonical example would be something like city-state dropdowns. Say it's initially a remote database hit to some "City/State" tables, it returns a small amount of data, it changes infrequently (The US has had 50 states for the last half-century), so it's read many times but not prone to stale data.

What is a bad candidate for caching? Pretty much, the opposite of the good criteria.

Pitfalls with caching

Merely creating the dictionary isn't the problem. The problem will be integrating it (seamlessly) into your data persistence layer, and then ensuring the cache doesn't become stale - especially across a distributed environment - and then making sure it actually improves your performance instead of degrading it.

Stale data is the bane of caching. There are at least a few ways to deal with stale data:

Apply a time-out policy such that all data expires after N minutes, but that won't be acceptable for most scenarios.
If your app is changing the data (such as updating a details page), then have the DAL method (that calls the update SQL) also ping the cache to clear the stale object. This requires that your application somehow keeps track of which objects depend on which piece of data. It works great with a domain model, which requires more design upfront, but can have huge payoffs.
If someone else is directly changing the database, such as a DBA running ad-hoc SQL in production, then consider providing some admin console that lets them clear segments of the cache. For example, if the DBA did a mass-update of all salaries, then have the admin page allow you to flush all salary-related objects out of the cache. This requires some infrastructure for tagging each cached object (perhaps a master config file), and discipline from the DBA to check that admin page.
Beware of "database cache dependencies" (ASP.Net 2.0?), that claim to let you apply triggers to the database table, and then automatically clear the appropriate cache items when a specific row/column is updated. I've personally never gotten this to successfully work, have heard lots of horror stories (especially when deploying it across a DMZ), and it forces the domain design into the database instead of the middle tier. Although I'm all ears to anyone who's had a success story here.

Some other things to keep in mind:

Where should my cache tie in? Ideally, you'd want the backend cache abstracted via your dataAccess layer. This becomes very feasible with CodeGeneration. Whether data is pulled from cache or the live database is just a tuning option. Just like you don't want to put in-line SQL throughout your codebehind pages, you probably don't want to tightly-couple all your UI code to your cache provider. The frontend cache can be referenced in your UI, but again, be aware of too much plumbing code that "designs you into a corner".
Only temporary - A cache is not a persistent data store, it is not merely a "mirror" to scale out your database, as you must account for the cache being cleared. A data access method should always have a means to recreate the cached data.
Why use a remote cache? If both the cache and database servers both have remote hits, what's the difference? In its simplest form, this helps scale out the database (which is usually a bottleneck). Every hit on the cache is one less hit on the already-overloaded database. Even after the remote hit, the cache can have a much faster lookup time because while the cache stores a created object, the database may still need to query 1 million rows to collect the data.
Control Panel - You'll want to provide an easy way to flush the entire cache, or even segments of the cache, without restarting any servers. It's also great moral support to have a statistics page showing how many thousand (million) database calls have been spared.
Configuration - You'll want to provide a way to configure almost everything: the cache durations, what category of duration (short, medium, long), which objects get cached, and perhaps even turn off the entire cache for emergency troubleshooting. Caching is something that you want to tune based on actual production results. Ideally control of what gets cached is all abstracted to a few easy-to-manage config files. You do not want these config values hard-coded throughout your app.
Beware of over-caching - If done the wrong way, caching can actually screw your performance. Say you cache something that is continually obsolete, so instead of just doing the normal database hit, you continually also need to do the extra cache query.

Good things to read

There's an endless list of info on caching. Here are some that could be useful.

The Pragmatic Programmers have a whole book on caching with memcached! (I've personally used Memcached before, and liked it.)
MSDN: Caching Architecture Guide for .NET Framework Applications
Enterprise Library Caching Application Block

Yes, there's ton more that can be said about caching. Again, this is just a quick brain dump.

Sunday, June 28, 2009

23 features of an enterprise data access layer

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/23_features_of_an_enterprise_data_access_layer_1.htm]

Most line of business applications will die unless they have a strong data access strategy. Enterprise apps simply cannot afford to hard-code thousands of in-line SQL calls to an aspx code behind; the maintenance and lack of reuse and testability will kill you. I realize entire books are written on data-access strategy (Fowler, Dino/Andrea), and by much smarter men than I, so I only offer this blog post as a summary and braindump. I'm sure I've inevitably missed several important aspects. I also realize that developers take their Data Access Layers (DAL) very seriously and personally, and may consider some features more or less important than others.

Must-have features - This will get you started.

CRUD - Give you at least the basic CRUD (Create, Read, Update, Delete) functionality
Sorted paging and filtering - Provide a simple way to handle sorted-paging and filtering
Automatically generated - For the love of all that is good, do NOT write tons of manual data-access plumbing code by hand. Either code generate it, or use a dynamic ORM (like NHibernate)
Serializable objects - Domain objects should be serializable so you can persist them across the wire (such as store them in a cache). Sometimes this is solved as easily as slapping on attributes to your objects.
Handles concurrency - Even a where-clause check that simply compares a version (or datetime) stamp.
Transactions - Support transactions across multiple tables, such as either using the SQL transaction keyword, or the ADO.Net transaction (or something else?)

Good-to-have features - When you start scaling up, you're really going to want these.

FK and unique-index lookups - Provide those extra automatically generated FK and unique-index lookups on your tables.
Meta-data driven - Perhaps you define your entity model in xml files, and your process generates the rest from that (tables, SP, DAL, entity C# classes, etc...)
Mocked / Isolation-Framework-friendly - It could provide support for a mock database, or at least create interfaces for all the appropriate classes so you program against the interface instead of concrete classes.
Batching - (Includes transaction management). If you don't have the ability to batch two DAL calls together, because remote calls are relatively expensive, you'll inevitably start squishing unrelated calls into single spaghetti-blobs for performance reasons.
Insert an entire grid at once - This could be done via batching, or perhaps SQL 2008's new table value parameter.
Handle database validation errors - Ability to capture database validation errors and return them to the business tier. For example, checking that a code must be unique. (See: Why put logic in SP?)
Caching - for performance reasons, you'll eventually want to cache certain types of data. Ideally your DAL reads some cache-object config file and abstracts this all away, so you don't litter your codeBehinds with hard-coded cache calls. [LINK: thoughts on caching]
Multiple types of databases - Access multiple different types of databases, such as main, historical, reporting, etc...
Scales out to multiple, partitioned, databases - For example, your main application data store may be partitioned by user SSN, and hence you can spread out the load across multiple databases instead of having one, giant bottleneck.
Integrate with a validation framework - Perhaps by applying attributes to the entity objects (like what Enterprise Library Validation Block does), you may want your generator to be able to read both database schema info and external override values from an xml file. For example, say you have an Employee object with a "FirstName" property which maps to the EmpInfo table's FirstName column, the generator could pull the varchar length and required attributes from the database column, and then possibly pull a required expression pattern from the override xml file.
Audit trail for changes made - The business sponsors are going to want to see change history of certain fields, especially security and financial related ones.
Create UI admin pages - Provide the ability to create the admin UI pages for easy maintenance of each table. Even if you don't actually deploy these, they're a great developer aid.

Wow - These are more advanced

Partial update of an object - say you have a reusable Employee object with 30 fields, but you only need half those fields in some specific context, it can be beneficial to have a DAL that can be "smart" and updates only the fields that you used in a given context. Perhaps you could add a csv list to the base domain object (that Employee inherits from), and every time a property in Employee is set, it adds the field to that CSV list. Then, it passes that CSV list to the data access strategy, which only updates the fields in that list.
Provide a data dictionary so it integrates into other processes. Building off the meta-data approach (where you can automatically generate lots of extra plumbing to assist with integration and abstraction layers), you can start doing some really fancy things:
1. See every instance in the UI where a DB field was ultimately used
2. Provide clients a managed abstraction layer that lets them write their own reports given the UI views - not the backend tables.
3. Provide clients a managed abstraction layer that lets them do mass updates of their own data (this is a validation and security nightmare).
N-level undo - I've never personally implemented or needed this, but I hear CLSA.Net has it.
Return deep object graphs - Having a domain model is great, but there's the classic OOP - relational data mismatch. ORM mappers explicitly help solve this. Without some sort of ORM mapper, most application inevitably "settle" (?) for a transaction script or table module/active record approach. A deep object graph also requires lazy loading.
Database independence - Configure your database for easy switch from SQL Server to Oracle. You could do this at compile time by re-writing your code-generator templates. I've heard some architects insist that you should be able to do this at run time as well via a provider model, and updating some information in the config file (I've never personally done this).

Data access is a re-occurring problem, so the community has evolved a lot of different solutions. Consider some of these:

ORM mappers
- NHibernate - A popular ORM mapper (and check out Hudson Akridge's blog for NHibernate expertise)
CodeSmith-related (generates code at compile time)
- CLSA.Net - Rockford Lhotka's super enterprise framework, which has CodeSmith support.
- .NetTiers - A set of CodeSmith templates to create the entire DAL and UI admin pages.
- Your own custom-built thing (via CodeSmith)
Microsoft solutions
- Strongly-typed .Net DataSets - This is what .Net first had.
- Linq-to-Sql - Great for RAD apps, but I'm personally not sure how well it scales. Be sure to run SQL-Profiler on its generated sql.
- Microsoft Entity Framework - Microsoft's offering for a domain-model approach (as opposed to their out-of-the-box typed datasets).
I've heard about, but never personally used these:
- Castle ActiveRecord
- (Probably several others too...)
In-line SQL from your Aspx codebehind - ha ha, just kidding. Don't even think about it. Seriously... don't.

Thursday, June 18, 2009

BOOK: Microsoft .NET: Architecting Applications for the Enterprise

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/book_microsoft_net_architecting_applications_for_the_ente.htm]

As the .Net platform matures (almost version 4.0!), I'm seeing more and more good .Net architecture books coming out. One such book is Microsoft .NET: Architecting Applications for the Enterprise, by Dino Esposito and Andrea Saltarello.

The first section focused heavily on architectural principles. The book was worth getting just for Chapter 3 alone (Design Principles and Patterns), which provided a survey of the various concepts required for high-level architecture, such as OOP, Design Patterns, Structured Design, Separation of Concerns, Dependency Injection, Testability, Security, and AOP.

I also liked their chapter on DataAccess. They made a well-reasoned plug for NHiberante and the maintenance benefits of auto-generated dynamic SQL for the data access layer. I admit that I personally have "grown up" with a bias for code-generated stored procedures, but I can see the changing winds.

Their book is very focused on the standard N-tier layers: DataAccess, BusinessFacade, Service, and Presentation. Here's the table of contents:

Chapter 1: Architects and Architecture Today
Chapter 2: UML Essentials
Chapter 3: Design Principles and Patterns
Chapter 4: The Business Layer
Chapter 5: The Service Layer
Chapter 6: The Data Access Layer
Chapter 7: The Presentation Layer
Final Thoughts
Appendix: The Northwind Starter Kit

The book didn't discuss much on messaging, caching, validation, logging, system integrations, configuration, or other architectural components. However, most applications make or break on the data access strategy, so I can see the focus there. And, you could have an encyclopedia if you wanted to cover every aspect of enterprise architecture.

I found it interesting comparing the book to Fowler's landmark Patterns of Enterprise Application Architecture. Indeed, Dino and Andrea continually refer back to patterns in Fowler. The Dino/Andrea book almost seems intended as a sequel to Fowler's - it adds value by specializing in .Net, having the benefit of almost 6 years of hindsight, and providing constant web references and practical tools (many which didn't exist when Fowler wrote his book). Overall, it's a good read for any .Net Architect or aspiring developer. It's an especially good read for those who grew up as architects in a single company, and therefore may only have exposure to one way of doing architecture.

Tuesday, May 12, 2009

Automated Code Governance

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/automated_code_governance.htm]

There are lots of ways for a tech-lead to encourage standardization. However, any policy that requires manual enforcement will continually be facing an uphill battle. The problem with the human enforcer is that:

Enforcing policy is seen as being the "bad guy", and no-one wants to always be the bad guy
The human will not have time - they'll be pulled onto other features
The human will be accused of "ivory tower" antics that just slows down real work
The human cannot possibly monitor everyone's code every day

The optimal way is to have an automated build policy as part of your continuous integration. This policy could check for many objective metrics, such as (DISCLAIMER: I haven't personally implemented all of these yet - it's just a brainstorm based on various research):

Code Coverage - Enforces developers to write unit tests by demanding that the tests provide X code coverage.
Code metrics (like NDepend) - Runs static metrics like LineCount (discourages large methods that have multiple responsibilities) and cyclomatic code complexity (including checks for dependencies, which is often then #1 culprit that prevents testability).
Code duplication (like Simian) - Encourages refactoring by checking for chunks of duplicate code. Ideally, this covers not just C#, but all languages like HTML, JS, and SQL.
Static code analysis (like FxCop) - Runs static rules to check for bad or risky code, kind of like compiler warnings on steroids.
Stored Procedure scans - Creates a test database, and runs all the stored procs to check their query execution plan for performance bottlenecks (like table or index scans), or too many dependencies.

While policies sound cool, in the trenches, many devs view them as just a nuisance that slows down "real" work. Here are some problems to anticipate:

Devs don't want to do it - it's not fun to write high-quality code.
Devs complaining that they don't have time
Management pulling the rug out from under you (they don't have time, or they don't want to be the "bad" guy)
Makes build take too long

Given these types of problems, here are ideas to minimize any riots as you try to roll these out.

Set up policy first - without it failing the build yet, so everyone can see results for a few weeks.
Ensure that people can run all policy checks locally first, and verify that they pass locally.
Create an exclude list so any developer can register exceptions.
Grandfather all existing code by using this exclude list.
Minimize the scope of what is checked (start with just 1 core assembly, and gradually expand to others).
Roll out 1 policy at a time.
Ramp up your build servers. Consider a distributed build, such as using CruiseControl's project trigger feature.