I don’t know the name of this field of study, or if there is such a field, so I am sending out a call to my fellow math nerds for help. It smacks of combinatorics, permutations, and set/ring theory but I’m coming up empty. Answer the call, fellow pi mu epsilons and others! Ah-oooga! Ah-oooga!

I’ve got a set of arbitrary numbers, say customer account values, and I’m trying to determine if any combination of these accounts, summed up, will arrive at a given number (this given number is not a member of the set of customer numbers, aka this is not a straightforward additive ring problem).

For instance, let’s say I get a report from another group saying we have $139,827.91 in assets for California clients. My exhaustive list of client account values does not list their states, so I am trying to add them up in every permutation to “extract” which clients live in California, and how many clients are in California. I have no information beforehand except the individual account values and the total target value.

Any help? Is the problem P/NP-hard enough that brute force is the only real solution? There are certainly some easy optimizations to a brute force algo, but still.

No, I’ve got about 3800 accounts on report one I’m trying to add up to fit into about 40 categories on report two, and neither of these reports is under my purview so I can’t determine how either sliced up the data.

Hmm, thanks for the leads guys. Both of those seem to be more in line with best fit or min/max solutions, but I think that will help get me going down the right path. Thanks!

edit to Mike Jamieson: yeah that seems to confirm Talisker’s idea. Thanks.

On a more serious note, you might be able to formulate it as a linear programming problem, then see if there are any feasible integer-only solutions by LP relaxation.

Before worrying about the processing required to come up with these values, you’re aware this problem isn’t perfectly solvable, even beyond the NP-hard issue, right? Unless the 3800 numbers are extremely divergent on the scale of orders of magnitude, which I’m guessing they are not, you’re very likely to have many possible combinations that MIGHT be the ones that make up each of the 40 totals.

Mike Jamieson: Thanks for the further insight. I think I will get away with the {0,1} set however, since the Knapsack problem with {0,1} and all values > 0 can be solved in pseudo-polynomial time. Fortunately I don’t need to generalize this problem at all for future use, I’m working with a specifically defined set of unique accounts that are non-negative. :)

CCZ: That’s a good point. I expect that the probability of multiple combinations adding up to the same final number is extremely small, but my plan is to take this into account by comparing solutions as each month goes by. After two or three months, I should have eliminated any question as to which accounts goes into which set for the future.

I should mention I get to make some pretty helpful assumptions for this over-time system: in particular that an account, once assigned to a set, never changes to another set. In my example from the OP, it would be like assuming that a California resident never moves to another state. Because of the nature of my real sets here, I can make this assumption. I also don’t have to give a crap about processing time or space either since the total domain of accounts is relatively small; were I working on millions of accounts or something, that would be a challenge. Double woot!

The complexity (difficulty of solution) of subset sum can be viewed as depending on two parameters, N, the number of decision variables, and P, the precision of the problem (stated as the number of binary place values that it takes to state the problem). (Note: here the letters N and P mean something different than what they mean in the NP class of problems.)
The complexity of the best known algorithms is exponential in the smaller of the two parameters N and P. Thus, the problem is most difficult if N and P are of the same order. It only becomes easy if either N or P becomes very small.
If N (the number of variables) is small, then an exhaustive search for the solution is practical. If P (the number of place values) is a small fixed number, then there are dynamic programming algorithms that can solve it exactly.

I had this very issue at a company I worked at two years ago. Faced with it, I agonized for two days trying work out how exactly I would extrapolate the data.

Then they downsized me and it was someone else’s problem.

Good catch jellyfish, thanks. edit: Note, however, that the subset-sum problem is only NP-complete because it includes all integers, including those < 0. I don’t hafta do that nanny nanny boo boo. But thanks for the idea though. :)

That sucks bahimiron. In a sense though, it segues into why I can’t simply use the easiest solution to this problem: get the underlying methodologies from the two departments whose reports I am reconciling. Except in this “cost cutting” environment, when we’ve already had two rounds of layoffs with a third coming up, no one wants to give up any info they have for fear of giving away their visibility.

Fucking office politics. Visibility is more imoprtant than productivity. People are idiots.