Tuesday, July 22, 2008

Screen scraping the easy way with .Net

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/screen_scraping_the_easy_way_with_net.htm]

Sometimes you may want to collect mass amounts of data from many web pages, and the easiest way is to just screen-scrape it. For example, perhaps a site doesn't provide any other data export mechanism, or it only lets you look up one item at a time, but you really want to look up 1000 items. That's where you have an application request the html page, then parse through the response to get the data you want. This is becoming rarer and rarer as RSS feeds and data exporting becomes more popular. However, when you need to screen scrape, you really need to screen scrape. .Net makes it very easy:

WebClient w = new WebClient();
string strHtml = w.DownloadString(strUrl);

Using the WebClient class (in the System.Net namespace), you can simply call the DownloadString method, pass in the url, and it returns a string of html. From there, you can parse through with Regular Expressions, or perhaps an open-source html parser. It's almost too easy. Note that you don't need to call this from an ASP.Net web app - you could call it from any .Net app (console, service, windows forms, etc...). Scott Mitchell wrote a very good article about screen-scraping back in .Net 1.0, but I think new features since then have made it easier.

 

You could also use this for a crude form of web functional testing (if you didn't use MVC, and you didn't have VS Testers edition with MSTest function tests), or to write other web analysis tools (is the rendered html valid, are there any broken links, etc...)

 

Thursday, July 17, 2008

.Net is like the galaxy, they're both big and getting bigger

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/net_is_like_the_galaxy_theyre_both_big_and_getting_bigger.htm]

Like the galaxy, .Net is big, and its only getting bigger. It stretches as far as the eye can see, or better yet, as far as the mind can think We're now in the 5th release of .Net (1.0, 1.1, 2.0, 3.0, 3.5), each one adding more to the previous. This includes not just a bigger API, but fundamentally new technologies and techniques - Ajax, WPF (with Xaml), Silverlight, WCF, WWF, etc... The .Net ecosystem is growing too - with open source, guidance, blogs, and vendors. It is expanding across all aspects of development (including games, mobile devices, enterprise apps, rich media, hobbyists apps, etc...).

 

I see at least three practical consequences of this:

  1. It's too big for one person to "know it all". This is why prescriptive guidance and community consensus are so important. It also gives hope to younger developers - I've gotten to work with several younger new-hires, who initially think that they'll never make some innovative contribution to the team. I explain to them that because .Net is so big, as long as they keep trying, it's inevitable, they'll eventually come to a new frontier that no-one else on the team has seen - a new tool, a new trick, they'll be the first to pick up a new technology.

  2. How does someone keep up? There are plenty of ways to learn about .Net. However, the vastness of it all does force a normal person to pick a niche. It helps to pick, or work towards, a niche that you enjoy. By making learning a lifestyle, a developer can continually pick up new things. It also helps that .Net is growing in a good direction...

  3. It's growing in a good direction. It's not that .Net is expanding into chaos, but rather it's growing more and more powerful. Part of this is retiring older technologies, either by making them obsolete (who uses COM), or wrapping them with an more convenient technique (a Domain Specific Language, an easier tool or API). The new enhancements aren't making us developers dumber, but rather freeing us up to focus on more interesting problems.

I see these as good things. Software engineering's continual expansion is one of the things that so fascinates me with the field.

Wednesday, July 16, 2008

Book: Beyond Bullet Points

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/book_beyond_bullet_points.htm]

The corporate world is filled with endless PowerPoint presentations. Many of these are just templated slide after slide of bullet points, which can be boring. A recent book I read, Beyond Bullet Points, by Cliff Atkinson, explained an alternative technique to make PowerPoint more interesting. His idea (as best I understand it) is to mimic what other successful media do (like Hollywood) by telling a story with pictures instead of using bullet points. The end result is that it emphasizes the speaker's own words rather than endless PowerPoint text. I had the opportunity to attend the USNAF back in 2006, and several presentations used this technique, and it was indeed more lively.

Tuesday, July 15, 2008

The difference between projects, namespaces, assemblies, and physical source code files.

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/the_difference_between_projects_namespaces_assemblies_and.htm]

When creating simple applications, the project, namespace, assembly, and physical source code file usually are all related. For example, if you create a project "ClassLibrary1", it compiles to an assembly "ClassLibrary1.dll", creates a default class in namespace "ClassLibrary1", creates a folder "ClassLibrary1", and places the source code within that folder. Everything is consistent and simple.

 

However, simple is not always enough. These four things can all be independent.

  • Project - The visual studio project that contains all the source code (and other build types like embedded resources), which gets compiled into the assembly. A project can reference any file - including files outside of its folder structure. By opening the project in notepad, you can manually edit the include path to be an external reference: . The file icon will now look like a shortcut.

  • Assembly -  The physical dll that your code gets compiled to. One assembly can have many namespaces.

  • Namespace - The namespace is used to organize your classes. You can change the namespaces to anything you want using the namespace keyword. It does not need to match the assembly or folder structure.

  • Source Code - This does not need to be stored in the same directory as the project. So, you could have several projects all reference the same source code file. For example, you may have one master AssemblyInfo file that stores the main version, and then all your projects reference that file.

So, if you have an aspx page referencing "ClassLibrary1.Class1.DoStuff()", it doesn't care if that class is in Assembly "ClassLibrary1.dll" or "ClassLibrary1Other.dll", as long as it has access to both assemblies and the namespace is the same.

 

This can be useful for deployment, or sharing global files across multiple projects, or just neat-to-know trivia.

Sunday, July 13, 2008

Ideas to encourage your boss to invest in Silverlight

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/ideas_to_encourage_your_boss_to_invest_in_silverlight.htm]

 

Silverlight has a lot of benefits, but as a new technology, it also has problems. As a new technology, it is inevitably riskier as many of the kinks haven't been worked out yet. Managers, who want to avoid unnecessary risk, may shy away from such a technology. However, there are ways to encourage a manager to at least consider Silverlight:

  • Show an actual demo of what Silverlight can do (such as on the gallery). Talk is cheap, but seeing Silverlight in action is powerful.

  • Where feasible, consider developing simple internal tools with Silverlight. Managers almost expect devs to always insist on using the latest technology, regardless of it's business value. But if you believe enough in the tech to invest your own time learning it and applying it to a simple business problem that your department faces - that carries a lot of weight.

  • Emphasize the aspects of Silverlight that would benefit your team - perhaps a rich UI with animating charts, or drag and drop, or rich media, or C# on the client, or cross-browser, etc...

  • If all else fails, consider a little fear-mongering: "Our competitors will be using this". If not Silverlight, at least a Silverlight-competitor like flash.

Some managers were hesitant when JS came out ("it's got cross-browser problems", "not all client support it"), when .Net came out ("J2EE is the established enterprise platform"), when Ajax came out ("it will have security holes"), etc... There's understandably going to be some skepticism with Silverlight too, but that's ok. I personally believe that Silverlight can deliver, and therefore instead of trying to encourage managers to adopt it, managers will be recruiting developers who know it.

 

Wednesday, July 9, 2008

Persisting data in a web app by using frames

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/persisting_data_in_a_web_app_by_using_frames.htm]

A basic problem with developing web applications is that their foundation technology, html, is stateless. That means that you constantly need to jump through hoops in order to pass data from page1 to page2. Of course there are ways to solve this, such as using ASP.Net session state, querystrings, cookies, or persisting to a database. There is another way that may work for simple data if your app is hosted in a frame.

 

Say you have your main page, which is just a frameset. All the navigation occurs within that frameset, such that going from page1 to page2 merely updates the frame's url, it doesn't re-create the host page. This leaves the host page intact, including it's JavaScript state. Therefore, you could have a JavaScript variable persist data between pages.

<html>
  <head>
    <title>My Apptitle>
    <script language="javascript" type="text/javascript">
      var _javaScriptVar = null;
    script>
  head>
  <frameset>
      <frame src="Page1.aspx" id="mainFrame">
  frameset>
html>

You could then reference this variable from your child pages via the DOM:

window.parent._javaScriptVar = "someValue";

This means that page1 could set the value, and page2 could retrieve that value. To the end user, it looks like data has been persisted across pages. You could also expand this using JavaScript hashtables to store name-value pairs of data, and then add wrapper methods for an easy API. This is a surprisingly simple approach, and it has pros and cons:

 

Pro:

  • Very easy to implement for new apps

  • Scalable - as it stores data on the client, instead of on the server (like session state)

  • Can store strongly-typed data. This saves to a JavaScript variable, which can store complex data as opposed to just strings (although you could just use JSON to serialize most complex objects to a string and back)

  • It avoids cookies, which have their own limits and problems.

Con:

  • It messes up your URLs, as the user only sees the URL for the host page, not the child pages. (But this may be a good thing)

  • It is absolutely not secure, as any hacker could modify the JavaScript variables.

  • It does not persist across sessions - it's only good for convenience data on the UI.

Overall, it's a cute trick for certain apps. Although, I'd rather use Silverlight if I could.

Monday, July 7, 2008

Two limits with Silverlight (Beta 2)

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/two_limits_with_silverlight_beta_2.htm]

Silverlight has several fundamental benefits. However, there's always a flip-side, and it has some shortcomings too. There are at least two major limits that I see:

  1. Silverlight requires a separate plug-in. Although Flash also requires a plug-in, Flash has something like 98% market share, and is essentially as available as JavaScript. For Silverlight though, this separate plug-in will make many business sponsors take a second look. Of course, MS knows this and is actively working on it - they'll use the full dominance of MS sites (hotmail, msn, etc...) to prompt your to download Silverlight, they'll make it an automatic update so system admins can easily install it across the enterprise, they'll include it in future products, they'll convince popular sites to use it, and hence encourage all those extra viewers to download it. This separate plug-in is a limit, but not a show-stopper, especially for private or intranet apps.

  2. Silverlight is still a very young technology. After the JS release, and a 2.0 alpha, beta1, and beta2, it still doesn't even have a combo box! However, I'd expect that the Microsoft eco-system will rush to fill in these gaps via open source and the Silverlight community. Silverlight is young, but I'd expect the Microsoft faithful and developer community will make it grow fast.

As a developer, I realize that Silverlight has its problems, and an uphill climb, but I'm optimistic. I think that soon its strengths will outweigh its weaknesses.