Data crunching desktop

Exactive · August 11, 2014, 3:47pm

Man this is annoying. I need the fastest computer I can buy for around 1.2-1.5K for data crunching.I don’t need a graphics card. Just a fast CPU, fast SSD and fast memory (32GB) I suppose. For the life of me I can’t find one to buy on newegg. All the powerful computers are for gaming and have nice graphics cards I don’t need or are “workstations” which cost 3-4K for some weird reason…I really don’t have the time to build one myself. Anyone have any ideas? I don’t need to buy it from newegg…They just seem to have really fast shipping around here

Thanks!

Woolen_Horde · August 11, 2014, 4:24pm

Buy a Dell XPS 8700 (the new one with the refresh Haswell i7 that runs at 4GHz, base speed). They were just selling it for $700 not too long ago. Buy 32 GB of DDR3 on Newegg, and get a cheap 250GB or 500GB Samsung EVO. That’ll run you about $400-500 more. Voila.

Exactive · August 11, 2014, 4:59pm

Awesome, you just saved the taxpayers of CA at least 1000.00!!!

Woolen_Horde · August 11, 2014, 5:20pm

Wait, you didn’t get my consulting bill of $500!

Seriously, looking at the Dell site, get the $799.99 XPS 8700. It’s got the Core i7 4790. It’s really $749.99 after the coupon.

I rock an XPS 8700 (highly upgraded), and I love it. It’s fast, very quiet (especially if you stick it under your desk), and it’ll crunch through data like a dream. See this thread

mouselock · August 11, 2014, 6:51pm

What type of data crunching? If you’re running excel spreadsheets it’s one thing. If you’re writing your own routines, those good graphics cards might give you a ton more performance depending on the specifics of what you’re crunching, how competent your programmer is, and whether or not you need double precision…

Exactive · August 12, 2014, 8:00am

Ahh one can dream. That would be awesome, but I don’t have the programming skills to do it and hiring someone would cost $$$ and federal grants these days have a funding rate of 3-5% so the chance of me getting funding to do something like that is slim. Too bad too it could be very useful, and the software is all open source so it’s certainly possible.

mouselock · August 12, 2014, 9:43am

3-5%? Ick. What field are you in?

KaoFloppy · August 12, 2014, 2:42pm

Any chance you can describe where your data comes from (database, spreadsheets, printouts), what it looks/tastes like (numbers, text, pictures), and what kind of crunching you’re doing with them, without giving any state secrets away?

Exactive · August 12, 2014, 6:31pm

Biological sciences
funding comes from NIH and NSF mostly. NIH RO1 grant funding rates (the good size grants most people get to fund their labs) was about 16% in 2013. Other agencies seem much lower. Although it’s really hard to get a definite number. Talk with my colleagues seems to center on 6%, but who knows maybe everyone is just being pessimistic.

Exactive · August 12, 2014, 6:42pm

Lol no secretes here… Data comes from tandem mass spectrometers (Mass spectrometry - Wikipedia) which are probably the coolest machines on earth. After some fiddling the data ends up in an XML file and then one of the things I do with this XML file is use software like this ( X! Tandem ) to match the tandem mass spectra (MS/MS spectra) to theoretical spectra from known peptide sequences. There are other software that does this, but this is the main one I use. These peptides are then reconstructed into proteins. We do this to determine what proteins are in something (cancer cell for example) and how they are change. Anyway that’s it in a nutshell. This software can take 15-60 minutes to do this per xml file, other software can take 1-2 hours. It becomes a bottleneck when you have 20-100 of these XML files to crunch. We have a SGE cluster which I normally use, but my nodes are old and it’s sometimes just faster to do the crunching on a fast desktop.

Woolen_Horde · August 12, 2014, 6:47pm

Exactive:

Lol no secretes here… Data comes from tandem mass spectrometers (Mass spectrometry - Wikipedia) which are probably the coolest machines on earth. After some fiddling the data ends up in an XML file and then one of the things I do with this XML file is use software like this ( X! Tandem ) to match the tandem mass spectra (MS/MS spectra) to theoretical spectra from known peptide sequences. There are other software that does this, but this is the main one I use. These peptides are then reconstructed into proteins. We do this to determine what proteins are in something (cancer cell for example) and how they are change. Anyway that’s it in a nutshell. This software can take 15-60 minutes to do this per xml file, other software can take 1-2 hours. It becomes a bottleneck when you have 20-100 of these XML files to crunch. We have a SGE cluster which I normally use, but my nodes are old and it’s sometimes just faster to do the crunching on a fast desktop.

If this were a movie, this is where I go: “In English, Poindexter.”

mouselock · August 12, 2014, 8:52pm

Heh… I was going to try to explain it, but I realized it wouldn’t really be any more comprehensible.

Basically, he’s doing a deconvolution of a tremendously complex aggregate function by finding a best fit via sorting relative to probably hundreds of thousands of known single functions in arbitrary combinations.

Way back when I would’ve said something like “This is a cool problem. Maybe I’ll poke at it sometime to see how hard it really would be to adapt to GPUs.”

However, these days I’d have to get around to it in my Copious Amounts of Free Time ™*, so good luck on it Exactive. (Also I presume your known signal database is pretty massive, which means GPUs would actually be pretty swamped by data thrash; they’re not particularly well suited for purposes like this due to memory issues. You wouldn’t happen to know the size of the theoretical spectra data set, would you, Exactive?)

*Copious Amounts of Free Time™ means when I have nothing else I need to be doing for my day job. I.e. when I’m either tenured, or retired/on my deathbed. And since I’ve pretty much given up pursuing a tenure track position…

gruntled · August 13, 2014, 4:55am

It sounds like what you want is as many cores as possible, even if each core isn’t the fastest thing on earth. The cheapskate’s way to parallel process this kind of problem is to just start multiple instances of your program each working on a different input file. You need enough memory, so you need to know how much a single instance requires. 3 basic (cheap) i7 boxes could be processing 20+ files at once (I usually leave one thread free so the OS can do stuff, but it could be that linux is much smarter than win7 about resource management).

Exactive · August 13, 2014, 8:55am

[QUOTE=mouselock;3615503]Heh… I was going to try to explain it, but I realized it wouldn’t really be any more comprehensible.

You wouldn’t happen to know the size of the theoretical spectra data set, would you, Exactive?)

Typically I have 50 thousand spectra I need to match per analysis. Usually each spectra is compared to 5 thousand to 5 million theoretical spectra. Each spectra has between 50-500 ions it needs to match, score and calculate a statistical confidence (here is a decent explanation ) . The median is probably 20 thousand, but I’m just guessing. There has been talk about getting software like this to work on GPU’s for years, but it has never happened. Maybe there is a technical reason or maybe there has never been funding to do it…who knows.

J_Thomas · August 13, 2014, 9:23am

Exactive:

Lol no secretes here… Data comes from tandem mass spectrometers (Mass spectrometry - Wikipedia) which are probably the coolest machines on earth. After some fiddling the data ends up in an XML file and then one of the things I do with this XML file is use software like this ( X! Tandem ) to match the tandem mass spectra (MS/MS spectra) to theoretical spectra from known peptide sequences. There are other software that does this, but this is the main one I use. These peptides are then reconstructed into proteins. We do this to determine what proteins are in something (cancer cell for example) and how they are change. Anyway that’s it in a nutshell. This software can take 15-60 minutes to do this per xml file, other software can take 1-2 hours. It becomes a bottleneck when you have 20-100 of these XML files to crunch. We have a SGE cluster which I normally use, but my nodes are old and it’s sometimes just faster to do the crunching on a fast desktop.

mouselock · August 13, 2014, 1:14pm

The size of your data set is likely a big problem. Even though your theoretical spectra contain probably a few dozen to maybe 100 delta function for matching,the sheer size of the spectral space you have to map against is tremendous. And, I suspect, the algorithms are probably pretty “branchy” (that is, you find one potential spectrum that fits, and that narrows down your remaining sample space, then you find another and it narrows it down again, but at each step there are multiple spectra that could be chosen, and so you end up with multiple tree-like structures). If so, it’s a bad data access paradigm for GPUs. Not undoable, but not straightforward and requires some real “Think outside the box” algorithm construction.

Pretty cool problem from my side of things. But it’s probably akin to sequence alignment and the like (although I believe there are GPU based sequence alignment programs these days…) for implementation details/difficulty. The slides look like they’re roughly how I’d envisioned them, though the discarding of spectral data from the reference spectrum is unexpected. If it’s a large peak, and you’re going to say “Hey, this set of peaks is possibly from this compound here!” you’d expect to find those peaks there. I’m guessing there are systematic reactions that can account for peak degredation in certain cases?

Stuff like this makes me wish I’d segued/could feasibly still explore bioinformatics. Nifty physical chemistry/math problem.

mouselock · August 13, 2014, 1:38pm

Heh… -I- thought my description was simpler.

Try this:

Think of a mathematical function. Any function. Say sin(x). You can represent sin(x) by adding other mathematical function together. For example:

sin(x) = sin(x) (Trivial!)
sinx(x) = x - (x^3)/6 + x^5/120 + x^7/5040 + … (Taylor series)

or, in general:

sin(x) = af1(x) + bf2(x) + cf3(x) + df4(x) + …

The -best- representation of a function is generally the one which has the smallest basis set (i.e. which has the fewest number of terms on the right hand side). So that’d be, in this case, sin(x). (Trivial result, right?)

In Exactive’s case, the left hand side is his mass spectrum (which is the output from an instrument that systematically takes fragments of stuff, in this case it sounds like biological samples, breaks them up under known conditions, and then produces some output based on the fragments these samples break into). The idea is that the process of breaking up the samples and collecting the data is reproducible: A 100% pure sample of substance A will always produce the same output function.

So then, the question becomes: What is actually in substance A. (He cares, because if he knows what is in it, he might know how to modify or otherwise interact with it. If substance A is the proteins generated inside a cancerous growth’s cells, this is probably something of interest, for example.)

What he’s trying to do, in this case, is figure out the equivalent of the right-hand side of the above equation, because knowing the pieces that go into his output spectrum tells him something about the proteins that were originally in the system. If he knows the proteins originally in the system, he can use that information to try to help understand what’s normal/different bout the system and how to better interact with it.

So, anyway, back to the fitting. He has the left hand side above, and he basically has a huge book of potential functions for the right hand side. The problem is that the left hand side of his data is extremely complex: Spikes all over the place, with different heighths, and often so close to each other that you get a function that may look like one shape, but is actually multiple other shapes added together. So the basic approach is to pick a potential function for the right hand side, and say “Do I see points on my data where this function shows there are points?” If so, that function from the book is a potential fit. So you note it, and move on to the next function. He’s flipping through a list of roughly 5k - 5m functions for the right hand side for each of 50k function on the left hand side. Each comparison is relatively easy (relatively!), but the sum total of all the comparisons is not. But the basic idea is:

For each real spectrum (left hand side)

Look at model spectra?
Does it fit at all? If so, give it a score based on how well it fits(a).
Goto 1 for each theoretical spectrum (right hand side)

a) “How well it fits” is non-trivial, and is where the algorithm for fitting he linked comes in. I have issues with the link he gave, because it seemingly throws out peaks that should be there for the model but may not be in he real spectrum. However, there may be reasons for this couched in he chemistry that I’m ignorant of.

Hopefully that helps some? I understand the generics of what he’s doing quite well (one of my tasks right now, albeit on the backburner, is to find better methods for functional decomposition of interaction functions for chemical simulation, which is a conceptually similar task to finding the base spectra from an aggregate spectrum), but not the specifics (since I don’t work in bio, and in general don’t have a good enough organic chemistry knowledge to understand how potential side reactions muck with the transition from theoretical to actual spectral components).

If not and you’re interested, holler, and I can try to clarify whichever part(s) you need more info on. If not, more cute dogs looking confused is a great signal! ;)

(Sorry, I’m a science geek of the worst kind, who assumes everyone else would also be if only they had complex things explained, so I’ll explain as long as folks would like. Unfortunately, I may not be good at breaking it down into appropriate sized pieces and assumptions for random audiences. And there’s no getting around at least a little bit of math knowledge; though if you’re not aware of functions in concept then that may be a hard limit. :) )

Exactive · August 13, 2014, 8:12pm

This is a nice observation, I do not remember the original reason the software I use does this. There are more sophisticated matching algorithms out there for sure, but they can be 10-30x slower…

This is considered the best review out there if you’re interested

http://www.sciencedirect.com/science/article/pii/S1874391910002496

mouselock · August 13, 2014, 9:54pm

Thanks for the ref. I must say I like the fact that all the lit in your side of the creek seems to be freely available. Most any ref I gave you from my work would pop up a big disclaimer about needing to log in with credentials or pony up from $28 to $135 to download the article.

gruntled · August 14, 2014, 6:22am

The paper he linked is in an Elsevier journal - they are the poster child for ‘super high subscription prices, everything behind a paywall’ commercial academic publishing. But even they have been forced by NIH rules to open access to papers based on federal funding a year after publication.