SQL Good Idea/Bad Idea

30 October 2009

The following is a SQL function that splits the string provided as an argument into a table of values on the comma character:

CREATE FUNCTION Split
    (@Argument NVARCHAR(MAX))
RETURNS
    @List TABLE (Item NVARCHAR(20))
AS
BEGIN
    -- While a comma still exists in the argument
    WHILE (CHARINDEX(',', @Argument) > 0)
    BEGIN
        -- Add substring from start of argument to comma to result list (trim string to remove spaces)
        INSERT INTO @List
            VALUES (RTRIM(LTRIM(SUBSTRING(@Argument, 1, CHARINDEX(',', @Argument) - 1))))

        -- Remove the item from argument
        SET @Argument = SUBSTRING(@Argument, CHARINDEX(',', @Argument) + 1, LEN(@Argument))
    END

    -- Add last item to the list
    INSERT INTO @List VALUES (RTRIM(LTRIM(@Argument)))

    RETURN
END
GO

So if passed the argument ‘1, 2, 3, 4′, the function will return a table containing ‘1′, ‘2′, ‘3′ and ‘4′. The reason I wrote the function was because I had to wrap a set of queries in SQL stored procedures to provide an interface. The queries were pretty complex but for this example, I will only use a single table, let’s say a table containing employee information:

CREATE TABLE Employee
(
    ID INT PRIMARY KEY,
    Name NVARCHAR(200),
    Salary INT
)
GO

We need to wrap a query that returns the salary for a given list of employees, the list being provided as a comma-separated string of IDs. There are multiple ways to do this, I will just give a good idea and a bad idea.

Good Idea

CREATE PROCEDURE GetSalaries
    (@Employee_IDs NVARCHAR(MAX))
AS
    SELECT ID, Salary
        FROM Employee
        WHERE ID IN (SELECT CAST (Item AS INT) FROM Split(@Employee_IDs))
GO

Using the initial function, the input is split into individual IDs and cast as INT. ID and Salary are extracted from the table with the condition that ID is in the list. So for the sample data

INSERT INTO Employee VALUES (1, 'John Doe', 10000)
INSERT INTO Employee VALUES (2, 'Jane Doe', 20000)
INSERT INTO Employee VALUES (3, 'John Doe the 2nd', 30000)
GO

Calling

GetSalaries '1, 2, 4'

should return

1 10000
2 20000

(as ID 4 doesn’t exist in the table and ID 3 is not in the argument).

Bad Idea

CREATE PROCEDURE GetSalaries
    (@Employee_IDs NVARCHAR(MAX))
AS
    EXEC ('SELECT ID, Salary FROM Employee WHERE ID IN (' + @Employee_IDs + ')')
GO

The above procedure, using dynamic SQL, should behave similar to the previous procedure:

GetSalaries '1, 2, 4'

should still return

1 10000
2 20000

It’s even shorter. And it doesn’t require a Split function to parse the input. So why I call this a bad idea? The stored procedure will be provided as an interface, meaning the input will be beyond our control. And input could also be

GetSalaries '1); DROP TABLE Employee --'

Run it and see what happens ;)


The Null Object Pattern

18 June 2009

The null object represents an object that does not exist. I wanted to write a post about this pattern for some time now but I just didn’t have the inspiration to come up with a good example. Fortunately, I wrote some code that needed to use just this pattern at work. Basically we had a map file which stored some additional information which was needed for reporting. Let’s call it customer map file. Thought not used by the system itself, which just uses a CustomerID, the map file contained information on the customer which had to be used in the reports. Each row in the map file contains the following tab separated values: CustomerId, CustomerName, CustomerAddress. So I wrote a class to represent a row in the file:

class MapRow
{
    private int customerId;
    private string customerName;
    private string customerAddress;

    public MapRow(string lineOfText)
    {
        // Parse a line from map file and set private members
        ...
    }

    public int CustomerId
    {
        get
        {
            return customerId;
        }
    }

    public string CustomerName
    {
        get
        {
            return customerName;
        }
    }

    public string CustomerAddress
    {
        get
        {
            return customerAddress;
        }
    }
}

Then I wrote another class to open and read the file, which wraps a dictionary and returns MapRow objects based on CustomerId:

class MapFile
{
    private Dictionary mapRows;

    public MapFile(string fileName)
    {
        // Read map file and for each row create a new MapRow then add it to the dictionary (key is CustomerId)
        ...
    }

    public MapRow this[int customerId]
    {
        get
        {
            return mapRows[customerId];
        }
    }
}

In case you are wondering, the last block of code implements an indexer on the class and can be used like:

MapFile mapFile = new MapFile("CustomerMapFile.tsv");

// Get map data for CustomerId 1
MapRow row = mapFile[1];

So, using the customerId from the system, I started generating reports by calling mapFile[customerId].CustomerName and mapFile[customerId].CustomerAddress when I realized that, for whatever reason, some customers don’t appear in the map file.

In the current implementation, the dictionary throws an exception in this case – key not found. The bad way to go about it is this:

public MapRow this[int customerId]
{
    get
    {
        if (mapRows.ContainsKey(customerId))
        {
            return mapRows[customerId];
        }

        return null;
    }
}

Why would this be the bad way? Because all code that indexes MapRow will have to check that the return object is not null. Let’s say that reporting does something like

reportStream.WriteLine("Name: {0}, Address: {1}",
    mapFile[customerId].CustomerName,
    mapFile[customerId].CustomerAddress);

Now we would have to refactor to this:

MapRow row = mapFile[customerId];

if (row != null)
{
    reportStream.WriteLine("Name: {0}, Address: {1}", row.CustomerName, row.CustomerAddress);
}
else
{
    reportStream.WriteLine("Name: Not Available, Address: Not Available");
}

And just imagine how the code gets if the map is used in multiple places. The solution is to use a null object which represents unavailable customer data. We can add it to the MapRow class as follows:

private static MapRow nullRow = new MapRow("-1\tNot Available\tNot Available");

public static MapRow NullRow
{
    get
    {
        return nullRow;
    }
}

And the indexer becomes:

public MapRow this[int customerId]
{
    get
    {
        if (mapRows.ContainsKey(customerId))
        {
            return mapRows[customerId];
        }

        return MapRow.NullRow;
    }
}

Now we can again use the simple

reportStream.WriteLine("Name: {0}, Address: {1}", row.CustomerName, row.CustomerAddress);

Whenever you find yourself checking again and again for nulls, you should consider using the null object pattern instead. Implementations of it appear throughout the .NET Framework. For example, String.Empty represents an empty string, while StreamWriter.Null represents a strem to which one can write but the written data doesn’t go anywhere.


The Singleton Pattern in C#

10 May 2009

What

The singleton pattern is used to restrict a class to a single instance. This is useful when some global state is required across the application or when a single instance of the class is sufficient and multiple instances are not desired – for example the class consumes a lot of resources. The singleton is implemented with a private constructor to prevent outside instantiation and a static method (or property) that creates and returns the unique instance when called.

Why

When implementing a singleton in C#, a natural question arises: why use a singleton instead of a static class? It is true that a static class accomplishes the same thing – provides a single instance of the class. The difference comes from the fact that static classes cannot implement any interfaces and can only derive from the base Object class.

Sometimes it is needed for the class to implement a certain interface. In these cases, singletons become useful because they don’t have these limitations.

staticvssingleton

It is true that a static class implementation is cleaner and easier to use than a singleton, as it can be seen above. We can use Processor.Process() with the static class while the singleton is called using Processor.Instance.Process(), not to mention that the static class doesn’t need the private attribute and doesn’t have to explicitly declare a private constructor. Still, the following can’t be implemented using static classes:

derivedsingletons

Both OnlineProcessor and OfflineProcessor unique instances can be passed as Processor objects.

How

A simple C# implementation of a singleton looks like this:

public sealed class SingletonClass
{
    private static readonly SingletonClass instance = new SingletonClass();

    private SingletonClass() { }

    public static SingletonClass Instance
    {
        get
        {
            return instance;
        }
    }
}

For a more detailed explanation, visit MSDN.

Although the constructor doesn’t do anything, it is important to declare it as private to hide it from the outside code. When the constructor is declared as private, the class cannot be instantiated from outside itself.

The static readonly attribute holds the unique instance. Note that declaring it this way ensures lazy instantiation – the object will be created at runtime only when it is first referenced. The static Instance property is used to retrieve the instance by the outside code (SingletonClass.Instance).

Finally, the class is declared as sealed. On MSDN it is stated that this will prevent other classes to derive from this class and possibly duplicate the instance. The truth is that as long as all class constructors are declared as private, other classes won’t be able to derive from it. This happens because during instantiation constructors have to be called for all classes in a hierarchy starting from the top-most class and this won’t be possible from a derived class as long as all its parent’s constructors are private. It is still good practice to seal the class to make it obvious that it is not meant to be derived from (plus there is a gain in performance when calling virtual methods on a sealed class).

Singletons can be implemented based on this skeleton. Additional methods and properties shouldn’t be static, as they can only be accessed through the Instance property. This is the mechanism by which singletons can implement interfaces or be derived from any class.

Note that this simple implementation is not thread-safe as two threads could simultaneously reference the Instance property and, because of this, receive different objects. Again, look on MSDN for thread-safe implementation considerations.


The Abstract Factory Pattern

11 April 2009

Recently I came across a design problem where refactoring to the abstract factory pattern provided a really elegant solution. Of course, I will again talk about emergent design and evolution. Things start out really simple – as they always do – with a class which I will call the API class. The API class is actually a complex class which contains a bunch of other internal objects and handles execution. Think of it as the core of the application.

Another requirement is to initialize some parameters from command line arguments. The initial design was really simple:

initial

The Parser class parses the command line and creates an instance of Foo. This instance is then passed to the API class as a parameter. Really simple.

static void Main(string[] args)
{
    API api = new API();
    Parser parser = new Parser();


    parser.Parse(args);
    api.DoSomeStuff(parser.Foo);
}

Up to this point, things look great. The Parser class handles argument parsing and creates the required objects. The API class then uses these objects to perform the core operation.

Then, things started to get complicated. First, the API class got extended to perform a larger set of operations. Now I won’t go into the details of the class but suffice to say it’s not feasable to break apart the class into smaller ones. So we end up with the new class:

complexapi

The new API exposes more methods, each of these taking different parameters. We want to be able to call each one of these methods based on command line arguments. For example, if the application is called with dosomestuff /foovalue=5, we know we have to create the Foo object with the value 5 and call DoSomeStuff. If the application is called with dootherstuff /foovalue=5 /barvalue=10, we know we have to create the Foo object with value 5, Bar object with value 10 and call DoOtherStuff.

Changing the Parser class to accomodate the new API parameters, would leave us with a mamoth class:

bigparser

This is bad design. If we call the application with dosomestuff, the Bar and string attributes of the parser are never used. Same thing for the other modes. It’s obvious we need to split this class into smaller pieces. I came up with the following design:

parsers

This could seem a bit complex at first, but it makes perfect sense. The abstract base class Parser implements generic argument parsing logic. The BasicParser just looks at the first command line argument and, based on it, creates the appropriate StuffParser. So if the first command line argument is dosomestuff, it creates a SomeParser then delegates the rest of the parsing to it. If the first argument is dootherstuff, it creates an OtherParser and delegates to it. Each StuffParser maps to one of the API calls, meaning it creates the required objects for that call from the command line. As for the StuffParser class itself, it was added because, in practice, there are some common elements for parsing all modes, I just simplified thing a bit in the example.

As I said, I was content with the design of the parser classes. From the outside, parsing looked as easy as before:

...
BasicParser parser = new BasicParser();
parser.Parse(args);
...

The problem appeared when trying to retrieve the newly created parameters for passing them to the API. The BasicParser can return an instance of StuffParser, but how do we know which one? We can use

StuffParser stuffParser = parser.GetStuffParser();

but the StuffParser class doesn’t create the objects we are interesetd in for each execution mode – only the classes that derive from it do.

Reaching this point, I started to think which is the best way to solve this problem. My first thought was to add the API calls in the parser themselves. This would be the easiest way to go. Have Parser declare an abstract Execute method. Have BasicParser delegate the Execute call to the StuffParser and have each one of the derived parser make the appropriate API call. For example, SomeParser can call API.DoSomeStuff since it has the required Foo object. But I didn’t like this idea. I didn’t like it because it violated the Single Responsibility Principle, which states that every object should have a single responsiblity. The parser is parsing command line arguments. It is too much for it to also call API methods, handle possible exception etc. So this was not the way to go.

So I came up with another solution: I used downcasting.

API api = new API();
BasicParser parser = new BasicParser();

parser.Parse(args);

if (parser.GetStuffParser() is SomeParser)
{
    SomeParser someParser = parser.GetStuffParser() as SomeParser;
    api.DoSomeStuff(someParser.Foo);
}
else if (parser.GetStuffParser() is OtherParser)
{
    OtherParser otherParser = parser.GetStuffParser() as OtherParser;
    api.DoOtherStuff(otherParser.Foo, otherParser.Bar);
}
else if ...

So the parsers only do parsing now, and we determine which mode was called by checking the type of the StuffParser created. This worked fine but I hated the code. Downcasting is a code smell. When you have to downcast, as I did above, most probably the design is wrong. If an interface is provided – either as an actual interface or a base class – and you receive this interface from somewhere – another class, a method, a property etc. – you shouldn’t downcast. That’s exactly the reason the interface is provided, so you don’t care which actual implementation is used. So the code was smelly. I had to refactor. But how? I liked the simple interface of the parser, so I didn’t want to change that. I liked the fact that the parsers created required objects from command line arguments. I had to find a better way of linking API calls to the parsers. And here comes the Abstract Factory Pattern:

abstractfactory

The abstract StuffDoer class encapsulates an API instance and exposes an abstract DoStuff method which takes no parameters. Each one of the derived classes implements the method so it performs the appropriate API call. SomeStuffDoer calls API.DoSomeStuff. OtherStuffDoer calls API.DoOtherStuff etc. I also moved the parameters from the parser classes to the stuff-doing classes. Now the parser classes have the same, single responsibility – to parse the command line and create appropriate objects, the difference being that they setup a StuffDoer instance.

SomeParser for example will parse /foovalue=5, create a SomeStuffDoer object and set its Foo attribute to 5. Each StuffDoer wraps an API call. SomeStuffDoer.DoStuff() calls API.DoSomeStuff(param1). The DoStuff method now provides a common interface to all API calls.

We call the StuffParser the abstract factory, the classes deriving from it – the concrete factories, the StuffDoer class is called the abstract product and the classes deriving from it – the concrete products.

And the previously smelly code now looks like this:

BasicParser parser = new BasicParser();

parser.Parse(args);
parser.GetStuffDoer().DoStuff();

That’s it. The Single Responsibility Principle holds, the implementations of interfaces remain hidden, the code stops smelling.


The Bridge Pattern

27 March 2009

The bridge pattern is a great design pattern and an interesting thing about it is that most of the time the need for it doesn’t exist from the beginning and only appears along with new requirements. The initial design may be very good and not implement it, then other requirements appear and, if the proper refactoring isn’t done, the code can easily become a mess.

For example, working on a compiler, we end up with an Assembly base type which represents any assembly. Let’s also assume it has an abstract Compile() method used to generate code.

Going further, we make a distinction between an application (.exe file) and a library (.dll file) but since they are both assemblies and, of course, both can be compiled, we derive them from the base Assembly type and make sure each one implements its own Compile() operation. Now we have the following structure:

bridge1

As I said, a lot of times the need to implement a bridge pattern arises with new requirements, not being evident from the start. I would say the design is great the way it is. For now. Consider a new requirement of having separate debug and release builds – debug build means embedding some sort of debug information in the compiled assembly, release means optimizing code for best performance.
Having all compilation logic in the Application.Compile and Library.Compile methods and considering there isn’t that much difference between a debug and release build – they do have a lot of common processing, one might be tempted to continue deriving from the already existing classes:

bridge2

Now DebugApplication is a really small class, its Compile method just adds debug information and calls base.Compile(). Same thing with ReleaseApplication – perform some optimization and call base.Compile(). For the same resons, DebugLibrary and ReleaseLibrary are also two really small classes. We end up with four classes but one might consider they are not hard to maintain since they don’t contain a lot of code themselves, they just override one method. Responsibilities are clear and developers are happy.

Until a new variable is required. Let’s say that after some time, the need for a build targeting Mono instead of .NET arises. For simplicity’s sake, let’s assume that only one kind of build is required for Mono assemblies. We would need to create two additional classes MonoApplication and a MonoLibrary. Then the need for a new type of assembly appears. We would need to create 4 additional classes – the base type and the Debug, Release and Mono children. As you can see, things start to grow exponentially.

The solution is to refactor to the bridge pattern as soon as the first new requirement (Debug and Release builds) makes its appearance. Create a new Compiler base type that handles all compiling behavior. Derive DebugCompiler and ReleaseCompiler from it. Have Assembly keep a reference to a Compiler instance and use that in deriving classes to compile the code. Have the Application.Compile() and Library.Compile() delegate to that compiler instance.

bridge32

Let’s say that after some time, the need for a build targeting Mono instead of .NET arises. We would need to create a MonoCompiler class. Then the need for a new assembly appears. We would need to derive a new class from Assembly. That’s it.

Notice that going with a bridge from the start, it would’ve been overdesigning – having the following five classes just for compiling an Application and a Library seems a bit too much:

bridge4

I know the pattern might seem pretty obvious when presented in a simplified example, but a lot of times it’s rather difficult to spot in real-life. The first sign of something going wrong is, as above, when you notice you must create more than one class in the same hierarchy to introduce a single new varying thing.


Inginerie software

3 December 2008

Mă gândeam azi câte lucruri trebuie să știe un programator. Mă refer la un programator bun, care poate produce cod de calitate, nu WTF-uri.

Un programator bun trebuie să știe cel puțin un limbaj de programare, concepte de programare orientată obiect, design patterns, cum se scrie cod curat și ordonat, cum se comentează codul, Unified Modeling Language, cum se scrie documentație clar și cuprinzător, arhitectura calculatoarelor și, în mare, cum se execută un program, cel puțin o platformă și cam ce pune ea la dispoziție, structuri de date și algoritmi, despre internaționalizare, concepte de securitate, expresii regulate, despre unit testing, metode de debugging, metode de testare, despre life cycle-ul unui proiect, un IDE și tool-urile aferente, cum se utilizează o aplicație de source control, cum se utilizează o aplicație de bug tracking. Nu ar strica nici cunoștințe despre XML, baze de date și SQL, multithreading, network programming și lista poate continua.

Disclaimer: cu lista asta nu vreau să mă laud câte știu eu – de fapt mai am și eu puncte de acolo de acoperit/aprofundat.

Adevărul, extrem de interesant, este că în momentul în care cunoști toate cele de mai sus, ai toate instrumentele de care ai nevoie pentru a produce cod de calitate. Desigur, doar cu acestea nu poți face nimic. Trebuie să studiezi/înțelegi/aprofundezi și domeniul pentru care vrei să dezvolți, fie el Business Intelligence, scrisul de drivere, procesarea limbajului natural sau construirea jocurilor. Deci trebuie să ai capul mare :)


Nu uitați de expresiile regulate!

10 November 2008

Cred că dacă m-ar întreba un boboc de la informatică ce ar trebui să învețe, i-aș recomanda, printre primele lucruri, să învețe despre expresii regulate. Pentru că ai nevoie de ele aproape peste tot și totuși se insistă atât de puțin asupra studierii lor. Sunt funcții care fac te miri ce procesare de text în te miri câte linii de cod (cu bug-urile aferente) care pot fi reformulate cu o simplă expresie regulată.

Am dat astăzi pentru o funcție care să extragă ultimele n extensii ale unui fișier. Pentru exemplu.cu.trei.extensii, dacă se dorea extragerea ultimelor două extensii, rezultatul era .trei.extensii. Dacă fișierul nu avea nicio extensie, nu trebuia întors nimic. Simplu, nu? Să vezi funcție scrisă (în C#) cu IndexOf(”.”), cu un șir de caractere la care se tot adună extensii și te miri ce verificări pentru cazuri speciale. Vreo 15 linii de cod pentru asta.

Cum se face cu expresii regulate? Păi în primul rând extensiile au forma .extensie deci trebuie căutat ”.” și orice alt caracter de oricâte ori. Asta înseamnă \.[^\.]* – punct este un caracter special, deci trebuie folosit escape character-ul ”\”. [ ] este o clasă de caractere. ^ neagă clasa de caractere, deci orice în afară de ce este conținut între [ ]. * înseamnă de zero sau mai multe ori. Dacă este nevoie ca  ceva să se repete de un număr fix de ori, se folosește {i} unde i e numărul de repetiții dorit. Deci pentru 3 extensii, {3}. Și pentru a face matching-ul la capătul care trebuie – ultimele n extensii, nu primele – se folosește $ la capătul expresiei, pentru a o lega de capătul textului.

Pe scurt, în C# se poate înlocui o funcție ca cea despre care vorbeam la început, cu următoarea linie de cod:

Regex.Match(input, @"(\.[^\.]*){" + i + "}$") unde input este numele fișierului și i este numărul de extensii dorit.

Dacă susții că te pricepi cât de cât la programare și trebuie să cauți sau să extragi sau să validezi sau să înlocuiești sau să faci ceva, orice cu un șir de caractere sau un text, primul lucru la care trebuie să te gândești este cum faci asta cu expresii regulate. Primul lucru. Cel mai probabil că ceea ce încerci să faci cu o combinație de IndexOf, Split, Trim etc., băgate și într-un loop pe deasupra, poți face cu o singură linie care conține un Regex.Match. Trust me!