MPI & Parallelization
HyQMOM.jl is designed from the ground up for parallel execution using MPI (Message Passing Interface). This guide covers everything you need to know about running parallel simulations.
Overview
HyQMOM.jl uses domain decomposition parallelization optimized for the 35-moment kinetic system:
- xy-plane decomposition: The computational domain is divided among MPI ranks in the x-y plane
- z-direction replication: Each rank holds the complete z-dimension data
- Moment-aware halo exchange: All 35 moments synchronized between neighboring ranks
- Realizability-preserving communication: Ensures moment realizability across processor boundaries
- Rank 0 visualization: Interactive visualization runs only on the master rank (rank 0)
- Load balancing: Automatic distribution of grid points for optimal performance
Quick Start
Basic Parallel Execution
# Serial execution (1 process)
julia --project=. examples/run_3d_jets_timeseries.jl --Nx 40 --Ny 40
# Parallel execution (4 processes)
mpiexec -n 4 julia --project=. examples/run_3d_jets_timeseries.jl --Nx 100 --Ny 100
# High-resolution parallel run
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jl --Nx 120 --Ny 120 --Nz 60Automatic Detection
All HyQMOM.jl examples automatically detect whether they're running in serial or parallel mode:
- Serial mode:
MPI.Comm_size() == 1 - Parallel mode:
MPI.Comm_size() > 1
No code changes are needed to switch between serial and parallel execution.
Domain Decomposition
Spatial Decomposition Strategy
The computational domain is divided as follows:
Original domain: Nx × Ny × Nz
With P MPI ranks: Each rank gets (Nx/√P) × (Ny/√P) × Nz
Example with 4 ranks (2×2 decomposition):
Rank 0: [1:Nx/2, 1:Ny/2, 1:Nz]
Rank 1: [Nx/2+1:Nx, 1:Ny/2, 1:Nz]
Rank 2: [1:Nx/2, Ny/2+1:Ny, 1:Nz]
Rank 3: [Nx/2+1:Nx, Ny/2+1:Ny, 1:Nz]Ghost Cells and Halo Exchange
Each rank maintains ghost cells (halo regions) that store data from neighboring ranks:
- Purpose: Enable finite difference stencils across rank boundaries
- Width: Typically 2-3 cells depending on the numerical scheme
- Synchronization: Automatic exchange after each time step
Load Balancing
For optimal performance, choose the number of MPI ranks such that:
- Perfect square decomposition: Use ranks = 1, 4, 9, 16, 25, ... for best load balancing
- Sufficient work per rank: Each rank should have at least 10×10 grid points
- Memory constraints: More ranks = less memory per rank
Performance Guidelines
Choosing the Number of Ranks
Good choices:
# 1 rank: Serial execution
julia --project=. examples/run_3d_jets_timeseries.jl --Nx 40 --Ny 40
# 4 ranks: 2×2 decomposition
mpiexec -n 4 julia --project=. examples/run_3d_jets_timeseries.jl --Nx 80 --Ny 80
# 9 ranks: 3×3 decomposition
mpiexec -n 9 julia --project=. examples/run_3d_jets_timeseries.jl --Nx 120 --Ny 120
# 16 ranks: 4×4 decomposition
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jl --Nx 160 --Ny 160Avoid:
# Non-square numbers (poor load balancing)
mpiexec -n 6 julia ... # 6 doesn't factor nicely
mpiexec -n 10 julia ... # 10 doesn't factor nicely
# Too many ranks for resolution (communication overhead)
mpiexec -n 16 julia ... --Nx 40 --Ny 40 # Only 10×10 per rank (too small!)Scaling Guidelines
| Resolution | Recommended Ranks | Points per Rank | Use Case |
|---|---|---|---|
| 40×40×40 | 1-4 | 40×40 to 20×20 | Development |
| 80×80×80 | 4-9 | 20×20 to 13×13 | Production |
| 120×120×120 | 9-16 | 13×13 to 10×10 | High-res |
| 200×200×200 | 16-25 | 12×12 to 10×10 | HPC |
Memory Scaling
Memory usage per rank decreases approximately as:
Memory per rank ≈ Total memory / Number of ranksUse more ranks to fit larger problems in available memory:
# High memory usage (single rank)
julia ... --Nx 200 --Ny 200 --Nz 200 # 200×200×200 points on 1 rank
# Distributed memory (16 ranks)
mpiexec -n 16 julia ... --Nx 200 --Ny 200 --Nz 200 # 50×50×200 points per rankMPI Configuration
Installation and Setup
Ubuntu/Debian:
sudo apt-get install mpich libmpich-dev
# or
sudo apt-get install openmpi-bin libopenmpi-devmacOS (Homebrew):
brew install mpich
# or
brew install open-mpiVerify installation:
mpiexec --version
julia -e 'using MPI; println(MPI.versioninfo())'Julia MPI Configuration
HyQMOM.jl uses MPI.jl, which should automatically detect your system MPI:
using MPI
MPI.versioninfo() # Check detected MPI implementationIf you encounter issues, you may need to rebuild MPI.jl:
julia -e 'using Pkg; Pkg.build("MPI")'Visualization in Parallel
Automatic Rank 0 Visualization
Interactive visualization automatically runs only on rank 0:
if MPI.Comm_rank(MPI.COMM_WORLD) == 0
# Launch GLMakie viewer
interactive_3d_timeseries_streaming(filename, grid, params)
endBenefits:
- No display requirements on compute nodes
- Reduced memory usage on worker ranks
- Centralized output file management
Headless Parallel Execution
For HPC systems without display capabilities:
# Method 1: Environment variable
export HYQMOM_SKIP_PLOTTING=true
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jl
# Method 2: Command-line flag
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jl --no-viz true
# Method 3: CI detection (automatic)
export CI=true
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jlPerformance Scaling
Memory and Communication Scaling
Each processor stores:
- 35 moments × local grid points × 8 bytes/float64
- Halo cells: Additional boundary data for neighbor communication
- Temporary arrays: For flux computation and realizability checking
Memory estimate: ~280 bytes × (Nx/√P) × (Ny/√P) × Nz per processor, where P is the number of processors.
Communication Patterns:
Halo Exchange Frequency:
- Every time step for spatial flux computation
- After realizability correction to maintain consistency
- During visualization data collection (rank 0 only)
Communication Volume:
- Boundary faces: 35 moments × 2 × (Nx/√P + Ny/√P) × Nz per time step
- Corner exchanges: 35 moments × 4 corners for diagonal neighbors
Scaling Efficiency
Strong scaling (fixed problem size, more processors):
- Ideal range: 4-64 processors for typical problems
- Communication overhead: Becomes significant with >100 processors
- Sweet spot: 10,000-100,000 grid points per processor
Weak scaling (proportional problem size increase):
- Linear scaling: Up to 1000+ processors for large problems
- Memory bandwidth: Often the limiting factor on modern HPC systems
HPC and Batch Systems
SLURM Example
#!/bin/bash
#SBATCH --job-name=hyqmom
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=02:00:00
#SBATCH --mem=32G
module load julia/1.10
module load openmpi/4.1
export HYQMOM_SKIP_PLOTTING=true
export JULIA_NUM_THREADS=1
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jl \
--Nx 200 --Ny 200 --Nz 100 --tmax 1.0 --snapshot-interval 5PBS Example
#!/bin/bash
#PBS -N hyqmom_job
#PBS -l nodes=4:ppn=4
#PBS -l walltime=02:00:00
#PBS -l mem=64gb
cd $PBS_O_WORKDIR
module load julia openmpi
export HYQMOM_SKIP_PLOTTING=true
mpiexec -n 16 julia --project=. examples/run_3d_jets_timeseries.jl \
--Nx 160 --Ny 160 --tmax 0.5 --no-viz trueDebugging Parallel Runs
Common MPI Issues
MPI not found:
# Check MPI installation
which mpiexec
mpiexec --version
# Rebuild MPI.jl if needed
julia -e 'using Pkg; Pkg.build("MPI")'Rank mismatch errors:
# Ensure consistent Julia environments across nodes
julia --project=. -e 'using Pkg; Pkg.status()'
# Check MPI configuration
julia -e 'using MPI; MPI.Init(); println("Rank: ", MPI.Comm_rank(MPI.COMM_WORLD))'Communication timeouts:
# Try fewer ranks
mpiexec -n 2 julia ...
# Check network connectivity between nodes
mpiexec -n 4 hostnameDebugging Tools
Print rank information:
using MPI
MPI.Init()
rank = MPI.Comm_rank(MPI.COMM_WORLD)
size = MPI.Comm_size(MPI.COMM_WORLD)
println("Rank $rank of $size on $(gethostname())")Profile parallel performance:
# Use Julia's built-in profiler on each rank
julia --project=. -e '
using Profile
@profile include("examples/run_3d_jets_timeseries.jl")
Profile.print()
'Advanced Topics
Custom Domain Decomposition
For non-standard domain shapes or specialized hardware, you can modify the decomposition in src/mpi/domain_decomposition.jl.
Communication Optimization
- Overlap computation and communication: Use non-blocking MPI calls
- Minimize synchronization points: Reduce
MPI.Barrier()calls - Optimize data layout: Ensure contiguous memory access patterns
Hybrid Parallelization
Combine MPI with Julia threading:
export JULIA_NUM_THREADS=4
mpiexec -n 4 julia --project=. examples/run_3d_jets_timeseries.jl
# Total parallelism: 4 MPI ranks × 4 threads = 16 parallel workersPerformance Monitoring
Timing Analysis
# Built into examples - check output for timing information
julia --project=. examples/run_3d_jets_timeseries.jl --Nx 80 --Ny 80
# Look for output like:
# "Simulation completed in 45.2 seconds"
# "Time per step: 0.12 seconds"
# "Communication overhead: 5.3%"Memory Usage
# Monitor memory during execution
mpiexec -n 4 julia --project=. examples/run_3d_jets_timeseries.jl &
watch -n 1 'ps aux | grep julia'Scaling Studies
Test parallel efficiency:
# Weak scaling (constant work per rank - 40×40 each)
mpiexec -n 1 julia ... --Nx 40 --Ny 40 # 40×40 total (1 rank)
mpiexec -n 4 julia ... --Nx 80 --Ny 80 # 40×40 per rank (2×2 decomp)
mpiexec -n 9 julia ... --Nx 120 --Ny 120 # 40×40 per rank (3×3 decomp)
# Strong scaling (constant total work - 120×120 total)
mpiexec -n 1 julia ... --Nx 120 --Ny 120 # 120×120 on 1 rank
mpiexec -n 4 julia ... --Nx 120 --Ny 120 # 60×60 per rank (2×2 decomp)
mpiexec -n 9 julia ... --Nx 120 --Ny 120 # 40×40 per rank (3×3 decomp)