This is how I do it. Going from window to viewport coordinates is a linear transform of this kind (dunno if I can post LaTeX code here, it would look way better):
| Xs | = |eM11 0 | |X| + |eDx|
| Ys | |0 eM22| |Y| |eDy|
(Xs,Ys) are screen coordinates, (X,Y) are window/world coordinates.
This transform represents scaling in x (given by eM11), scaling in y (given by eM22) and translation (given by eDx and eDy). Scaling = zoom (eM11 and eM22 will always have the same value), translation = pan.
So, when you zoom, if the structure which represents the transform is called m_TWorld2Screen, you have to update eM11 and eM22, but also eDx and eDy, since the translation is affected by zoom (it's not the same distance):
m_TWorld2Screen.eM11 = newVal;
m_TWorld2Screen.eM22 = newVal;
m_TWorld2Screen.eDx = (m_TWorld2Screen - m_iWidth/2.0f) * newVal / oldZoomFactor + m_iWidth/2.0f;
m_TWorld2Screen.eDy = (m_TWorld2Screen.eDy- m_iHeight/2.0f) * newVal / oldZoomFactor + m_iHeight/2.0f;
-newVal is the new zoom factor (between 0 and 1)
-m_iWidth and m_iHeight are the viewport's dimensions
-oldZoomFactor is the previous zoom factor
And when you pan, you only need to update eDx and eDy:
m_TWorld2Screen.eDx = - m_TWorld2Screen.eDx / m_TWorld2Screen.eM11;
m_TWorld2Screen.eDy = - m_TWorld2Screen.eDy / m_TWorld2Screen.eM22;
The proof of these formulas is quite tedious to write without LaTeX, but this works for me.
You might need to define m_TScreen2World to do the inverse transform. If you understood my concept, ask away