Just start the threads in a loop and pass the matrix as parameter to the thread function. You shouldn't even need to use mutices/semaphores as the main program shouldn't touch the matrix(es) once it is passed to the threads.
As for matrix multiplication, I would have to look it up. It has been a while since I studied that.