This paper applies recently developed cross sectional and longitudinal propensity score matching estimators to data from the National Supported Work Demonstration that have been previously analyzed by LaLonde (1986) and Dehejia and Wahba (1998,1999). We find little support for recent claims in the econometrics and statistics literatures that traditional, cross sectional matching estimators generally provide a reliable method of evaluating social experiments (e.g. Dehejia and Wahba, 1998, 1999). Our results show that program impact estimates generated through propensity score matching are highly sensitive to choice of variables used in estimating the propensity scores and sensitive to the choice of analysis sample. Among the estimators we study, the difference in differences matching estimator is the most robust. We attribute its better performance to the fact that it eliminates temporarily invariant sources of bias that may arise, for example, when program participants and nonparticipants are geographically mismatched or from differences in survey questionnaires, which are both common sources of biases in evaluation studies.