Object Mother pattern with databases

Here is an article on it, in case you are not familiar: https://martinfowler.com/bliki/ObjectMother.html

I am wanting to use this to create test data. However, there are some disagreements on the team about how exactly to use this pattern.

The contention is about something I am going to refer to as reference data.
Reference data are tables in databases that are references by other tables but are immutable.

For example, lets take a State table which has a column for an State name, like “Florida”, and then another column for an abbreviation like “FL”. Maybe there are a bunch of other columns too.

Now I have another table, “Bill”. This represents some kind of bill, with line items and a bunch of other columns. One of these is a State foreign key, just so you know what state the Bill is from.

As a tester, I may want to make sure that the total, on the bill is the sum of the Bill’s line items. I do not care about the State column at all. Its incidental.

Now using the Object Mother pattern, I want to create a system that will fabricate a bill, which will then be persisted to the database, then the tests will run, and then Ill remove the bill.

Now to get to my question.
There are two schools of thought on the team.

Method 1: Create the bill, but all reference data, is gotten from the database. It is not fabricated because it is immutable and there is a lot of it (lots of tables, and some of them are quite large). The mother pattern will be used, but when a bill is created, the State will be an object with a real ID and data from the database instead of being fabricated.

The advantage is that you do not need to create / destroy and keep up to date reference data. If some table gets new items, then you are good because you just pull from the real tables. Same with deletions and updates.

The disadvantage that I can see with this method, is that you will need a real database (which our current testing needs right now). Also reference data could become corrupt (someone drops a table, deletes records, or just does something stupid).

Method 2: All data is fabricated, even reference data. This the most “pure” approach.

There are some big drawbacks to this pattern. You need to code all and create all reference data. You will need also keep this code up to date with important changes. You will need to populate all this data with each test case, and then clean it up. You also can’t run this against a database with any actual data in it. You could easily run into issues such as duplicate keys. For example, lets say that the State name in our
State table had to be unique. If there was a State of “Florida”, if you tried to create this reference item, it would blow up in your face.

The really big advantage of this method is that you can have an in-memory database to run all your tests. It would allow you to easily run these tests regardless of environment. You also have no fear of your test data being corrupted.

Anyway, I would like to know what people think is the right way. Also if you can provide links to support your point of view, it would greatly be appreciated. We have tried googling this, but so far we have not found any definitive answers to this question.

image

It seems like a “should we do this right or quickly” question. There is no way you can categorize method one as the right way in my opinion. That doesn’t mean its not the best fit for your environment & data though. And nobody can answer that but your own team.

Could you quantify “lots” and “large”?

Lots of tables - 100+
Large : Rows in the 100s of thousands or perhaps millions. While a table of millions of rows isn’t unusual, the fact that it is reference data does make it a bit more extreme to deal with, especially in the above scenarios.

For tables of reference data that have such large numbers of rows, I would not intend to load / create all those rows. Just the ones we need.

We are looking for advice or if there is a best practice. I know how I lean in this argument, but I am not saying anything because I do not want to taint any opinions.

Best option to me seems to replicate the production data (or reasonable portions of it) into a dev/QA database that you’ll use for tests (and keep the dump around to reset the database as needed) and use that.

Generating test data is fine but, as you mentioned, it can be a lot of work, especially if the data model keeps changing. It can also lead to artificial “safeties” in that it will have whatever data you make it generate, and not necessarily all the things that could happen in the real system, data-wise. On the other hand, if you have a tight design/system and a really good control on data patterns/interactions, it might be viable. But: lots of work for sure. Work that you’ll do once (and refine/iterate on as needed), but still lots of work.

I would extract the necessary subset of this data as a series of SQL INSERT statements, so you have the possibility to recreate the data if wiped. Having a limited set of reference data available for test is fine IMO as long as it’s clear what those are.

That way you can recreate the data if messed up wiped, and then just run tests against a test database with this subset of data available.

Ideally I would want to extract the DDL as well, so I can completely create a local database, but with that size of database, it does not seem practical.

We are still thinking about it. Thanks for the input so far.