Database question

playingwithknives · October 20, 2017, 10:56pm

I have a SQL database a work with a hierarchy I designed from a business perspective.

Companies have records
Individuals have records
These are joined by 20 different types of relationships. i.e director or shareholder.
Groups of companies will be connected by relationships showing shareholdings or control, with the parent company at the top.
The hierarchies can be complex but aren’t circular.
75,000 active entities, around 45,000 relationships iirc.

This is a fairly normal example, it can get a lot more complex. Imagine 300 helicopters and 600 jets and 50 billionaire owners.

Now give me a list of all Sikorskys 57s where the ultimate owners have a “corrupt” flag in his record.

I have been trying to get the dev team in charge of this database to write me reports. They cannot deliver. They cannot report on indirect relationships and return me lists of companies based on criteria. i.e all companies owned by company X, or individual Y. Or all individuals with a flag/value attached indirectly to certain criteria i.e companies based in USA.

Im of the feeling this is due to them being rubbish. Does this sound like an impossible task to anyone? I use third party databases that have hierarchies in databases with millions of records. I can manually do these queries level by level i.e I can use the inbuilt user “advanced find” interface to find all direct relationships, and then filters and vlookups in excel. I can draw/picture the logic required and express this in diagrams and text. I give examples of large tables I would like to see. I do struggle with doing indirect relationships without painfully going through it one by one. Ive also given them an example of what a flat single table would look like holding all the data because i could work off that too but they cant do that.

but my guys cant do it. They tried using Power BI, Tableau, Ive had 3 devs fail to give me a working report of a fairly easy query involving a less than 100 entities in total. They talked about a mirrored server to put something on too.
We outsourced our IT to a giant outsource company. Our dev team is transitional and its difficult to explain how company ownership and control works to those not in my line of work. There may be language barriers too.

The database (built by someone else) can display the hierarchy of a single entity so I know this can be done. There used to be a working group structure report but that was broken in a patch post-outsourcing.

Why cant this be coded? Why cant they deliver my reports? Is it me? Ive been the business side of IT projects for years now, I know what clueless requests from business look like but i dont feel this is it, but perhaps I am wrong.

and well, we all know what its like dealing with outsourced services. It could be they are just crap.

Fishbreath · October 20, 2017, 11:23pm

This is my vote. I’m not a database guy first and foremost (except I’m kind of becoming one on my company’s current project), and while the schema you’ve described sounds complicated, I don’t think it sounds unworkably so.

Clay · October 20, 2017, 11:28pm

I do a lot of database work. Either there are fundamental design flaws in the schema or your dev team is terrible. I’d be happy to take a look at the schema if you’d like.

CLWheeljack · October 20, 2017, 11:59pm

What they said. With those kinds of numbers (< 100K), you should absolutely be able to build a working model to give you those answers, even if you have to brute force it by tracking each entity to their eventual owner.

Now, it might be hard or even impossible the to do in SQL, but if you have access to the data, they should absolutely be able to do a one off analysis by, say, ETL’ing it somewhere. But, if they’re database guys, that probably isn’t their job.

Basically, it sounds like it’s easy to do for one record, but maybe hard to do for all records at once because of the data structure. More than hard, maybe prohibitively expensive. But there are few enough records (i.e. not, like, millions), that you could probably do it in reasonable-ish time frames (which might means dozens of hours) even just running on desktop hardware.

Guap · October 21, 2017, 12:01am

Yeah, it’s all about how the data is schema is and the key structure. When you want a report you are defining the filters, essentially. There will probably be grouping too, and probably sorting.

Either the relationships in between these tables is messed up (the key structure) or your guys are not getting a clear enough picture of the groupings/filters you want, or the team is bad. I’m guessing the team who built the database is not the same ones writing the reports.

arrendek · October 21, 2017, 12:20am

Are you sure it’s SQL and not something like NoSQL? Either way it’s doable but one is more difficult.

Clay · October 21, 2017, 12:34am

Either way, it’s a small database and could be reloaded into a better schema in less than a day, most likely.

playingwithknives · October 21, 2017, 12:53am

OK there’s enough in replies to give me an idea that im probably missing out some essential stuff from my question.

me: not in IT, not sure of exact tech. Advanced user only. Basic Access knowledge.
schema: not as good i bet i think it is
front end: MS Dynamics.
database: SQL
size: what if it was 500,000 entities in 5 years…
original devs: some kind of MS reseller consultancy
reports: run against the live database on a regular basis. could accept a mirror or data dump done overnight
budget: tiny
outsource co: the lowest tender offer
PBIs growing daily
Bugs: growing daily
complex integration projects with downstream systems taking up resources: lots
The bugfix that went in to fix the bugfix that went in to fix the bug: yeah a few
the guy who knew what he was doing: left
analysts creating records and relationships: recently replaced by offshore workers costing tenth what an experienced analyst makes.
data quality: going downhill

Yes, I think its also more than just the team thanks to replies, which i really appreciate. The brute force comments tie in with their lack of mirror server budget/sign offs from a maze of committees/god knows what else and my view that this was easier than perhaps it might be.

Thinking about it more if i had time i could probably export to access, run a make table query for direct relationships, then run a query on the new table for indirect level 2 , then a query on that for level 3 etc but its still manual and time consuming but maybe thats what im stuck with.

If i had visio i could lay out something but i’d have to do something that doesnt resemble my companies property too closely but I’m only trying to describe what actually exists, the underlying structures/concepts are public domain and we see in things like the Panama Papers.

They have more data and companies than me by a magnitude. I do note the web interface only does direct relationships and for single entities at a time. Perhaps i just need to fuck off to another of the big players and get to play with a 8 figure budget and a few million clients.

playingwithknives · October 21, 2017, 1:18am

I’m sure if i could re-do from scratch with someone knowledgeable this is the solution. I wonder how much could be changed in terms of relationship management and joins without amending the records or changing the system?

but huge company + labyrinth bureaucracy + outsourced + reduced budgets + its thoroughly integrated with other systems + its also a workflow tool + does global KPIs + this hierarchy is just a part of it+ im not the decision maker + urgh well i guess you get it

Clay · October 21, 2017, 1:19am

So – I routinely run live queries to update dashboards against tables in relational database that have > 20 million rows. With the proper indexing, it’s extremely fast. The size of the data you have isn’t the issue. There’s nothing wrong with MS SQL Server – it should be just fine. It sounds like either the people who run reports don’t know how to write SQL and can’t figure out how to work with the schema through the front end application they use (which is surprisingly common) or maybe they’re trying to upset you on another product.

SQL is pretty easy to learn to write. If you have direct access to the database, you could try it yourself. Otherwise, if you have a friend who knows SQL and you can have them take a look.

Do you have an interface for writing SQL queries against the database?

playingwithknives · October 21, 2017, 1:57am

I don’t, but the team do, I’ve watched them use it (MS SQL Something). They can all give me SQL queries that pull data from lots of tables and show direct relationships based on multiple criteria. I dont understand why anything beyond direct relationships is a struggle, I can describe the query in plain English quite easily. Ive amended SQL in the data link section in Excel and in Access in the past (changing field names etc) but might struggle beyond the basics.

In the long ruin would like it to be even more complicated, i,e return records based on the maths around indirect and direct shareholding percentages but for the meantime Im prepared to even accept a flat table showing all relationships that i could just manipulate in Power BI or Excel.

barstein · October 21, 2017, 2:37am

Based on the little I know about the various MS Dynamics products, they’re pretty inflexible in the reporting department to begin with, and require help from a Microsoft-approved reseller network developer for customized reports. I’m guessing that your dev team is simply not qualified/permitted to create the custom reports you need. I don’t know what your budget is but my recommendation would be to shop around for another MS-approved dev and get an estimate on a reporting solution.

baren · October 21, 2017, 4:06pm

It’s been a long time since I dealt with real database programming, but what you’re asking for sounds like it should be easy to do. If the request can be encoded as SQL, the results can be fed into something like Crystal Reports and made pretty slick. So the question is, how hard is it to make the query?

Answering that question really depends on the schema. There are things that can be done “wrong” in a schema that will make interrogating the data harder. I remember an example from my time working with IBM. There, the project lead was a general software guy, with little database experience. He modeled all the 1-to-many relationships as many-to-many relationships, with little join tables, arguing that these were more flexible than 1-to-many table setups. I pointed out to him that this this choice was increasing the costs of all the development around this thing, made queries take longer to run, and had opened the door to bad data that the apps kept having to take into account.

He was inflexible in this position. When the project went past deadline, he was removed and I was put in charge. I promptly fixed the schema, and things started moving much faster.

Here are some of the things I can think of that would make your request harder (but not impossible).

There’s a kind of soft limit on the number of joins in a query. Crossing that limit will not just slow down the query, it starts thrashing the disk with temporary tables, and could result in slower access for everyone.
It’s possible there aren’t indices on one or more of the relationships in your query. SQL will happily let you express that, but it runs dog slow. Fixing it requires modifying the schema, which typically requires a DBA to get involved, and will consume additional disk space.

cicobuff · October 21, 2017, 11:10pm

Have you explored Graph Databases?

Based on the information you have shown, it may serve you better than a traditional SQL based schema and storage architecture.

It heavily depends on the types of reports/queries that you are trying to build though. As in what are you trying to find out from the data/relationships you have.

I’m not an expert in Graph Databases, but just letting you know they exist and may fit your particular use case. Because “there are 20 different types of relationships” with “complex hierachies” screams bloody murder with traditional SQL queries when building reports and analysing data.

Clay · October 21, 2017, 11:33pm

I’ve used graph databases a lot. Neo4j is free, easy to query, has a nice browser-based GUI, and might help.

jsnell · October 22, 2017, 2:05am

This particular task would indeed be very painful when written in a single query with just joins. The SQL construct your developers need to learn about are Common Table Expressions (CTEs). It’s not a commonly taught feature, so it’s somewhat excusable for your developers to not be aware of it.

Here’s an example (in Postgres syntax; but MSSQL has the same concepts with just minor differences). Let’s simplify things to the core problem, with just one kind of entity and one kind of relationship. You’d have something like this:

create table company (
  id int primary key,
  is_corrupt boolean);
create table relationship (
  owner int references company(id),
  owned int references company(id));

Let’s say we have companies 1-8, with just 1 being marked as corrupt. Then define ownership relationships that look like the following:

jsnell=# select * from relationship;
 owner | owned 
-------+-------
     1 |     3
     2 |     4
     1 |     5
     2 |     5
     3 |     6
     4 |     7
     6 |     8
     7 |     8

To find all the directly corrupt companies (i.e. 1), you’d do something like:

select * from company where is_corrupt;

To find all companies that are corrupt, or directly controlled by corrupt companies (i.e. 1, 3, 5) you’d do:

select * from company
  where is_corrupt or
             id in (select owned from relationship
                      where owner in
                        (select id from company where is_corrupt));

Obviously this kind of approach doesn’t scale if you have more than a couple of layers of indirection. The query needs to be generic. That’s what a recursive CTE allowed. To find any company that’s transitively owned by a corrupt company you’d write something like:

with recursive controlled_by_corrupt as (
    select id from company where is_corrupt
  union all
    select owned from relationship as r,
           controlled_by_corrupt c
      where r.owner=c.id)
select * from company
  where id in (select * from controlled_by_corrupt);

And the result would be:

 id | is_corrupt 
----+------------
  1 | t
  3 | f
  5 | f
  6 | f
  8 | f