[Beowulf] Latency Sensitive Algorithms
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Michael T. Prinkey mprinkey at aeolusresearch.comSat May 7 12:52:22 PDT 2005
- Previous message: [Beowulf] CCL:Opteron or Nocona ? (fwd from cavallo@chemistry.unina.it)
- Next message: [Beowulf] CCL:Opteron or Nocona ? (fwd from cavallo@chemistry.unina.it)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi everyone, I have been doing some research for the development of a next-generation CFD code for multiphase flow. In the process, I have been investigating lots of algorithm options and trying to identify trade-offs. One of our primary tasks is to improve parallel scalability, even for fairly small problems. We can "solve" the scalability problem by running large cases or by using high speed interconnects, but that is not really a solution in my mind. As most of you are aware, CFD often comes down to solutions of large, sparse linear systems. Most are either Krylov space solvers or (algebraic) multigrid solvers...or some co-mingling of the two. After studying and using solvers in many of the large packages (PETSC, Hypre, Trilinos), it seems that all of them suffer from latency limitations. By that, I mean none provide a mechanism that allows overlap of communication and computation, at least not on the level of our problem space. In fact, by and large, the solver packages don't seem to try to do latency hiding at all. I don't think that is due to laziness. These latency limitations are algorithmic. Krylov methods in all of their forms use matrix/vector and vector/vector products. The matrix/vector product can use latency hiding, but the dot products are barriered by a global sum. For (relatively) high latency networks, that is a performance limit that can't be removed. In fact, global communication of some sort is probably necessary for any nonstationary iterative scheme. In multigrid, relaxation on the lowest levels can take advantage of latency hiding without mangling the algorithm too badly. Parallel multigrid typically uses Jacobi smoothing at parallel interfaces, so changing the update order to do latency hiding isn't too problematic. For the coarser levels, the amount of data to smooth becomes much smaller. While the amount of data to be communicated on coarser levels also drops, the network latency remains fixed. Eventually, it is better to just serialize the coarse mesh solution. The AMG solver in OpenFOAM, for example, just solves the equations with a Krylov solver once the coarse mesh reaches a user-specified size. This coarse grid treatment leads to a latency-induced parallel scaling problem as well. Is there any effort underway to build more latency tolerance sparse equation algorithms? I know of work done to reduce or clump the dot product calculations in Krylov methods by reorganizing computations. But these schemes are careful to retain the same underlying algorithm (though they still suffer stability issues). I haven't seen work done on new *parallel* algorithms that diverge from the classic serial algorithm. The only thing I have seen in this vein are the Schwartz decomposition techniques commonly used to procondition Newton-Krylov solvers. Perhaps there is a Krylov method that builds separate "local" spaces and "global" spaces. The local spaces use on-processor data allowing for fast dot-products. The "global" spaces might use older data that could be communicated while other work is being donw. Or perhaps there is a multigrid scheme that abandons the cycling strategy and instead allows all of the grids to be smoothed simultaneously. That would allow the interior calculations on the finer grids to hide the communication latency on all of the levels. I would think a latency-tolerant parallel linear equation algorithm/solver would be of great value to many projects that do not run on ASCI hardware. Maybe there are other common latency-sensitive algorithms that need similary attention. My CFD background makes me a bit myopic. So, have I missed something or is this area of research as "dead" as it seems? Have we accepts as writ that sparse linear equation solvers will always be suffer latency bottlenecks? There is a whole parallel discussion about memory bandwidth requirements of algorithms and what can be done to improve cache reuse, etc. In some ways, they are both asking the same question. Isn't there a better algorithm to solve sparse linear systems that preserves good convergence but better fits our available hardware? Thanks, Mike
- Previous message: [Beowulf] CCL:Opteron or Nocona ? (fwd from cavallo@chemistry.unina.it)
- Next message: [Beowulf] CCL:Opteron or Nocona ? (fwd from cavallo@chemistry.unina.it)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
