Scheduling Data-Intensive Workflows onto Storage-Constrained Distributed Resources

In this paper we examine the issue of optimizing disk
usage and of scheduling large-scale scientific workflows
onto distributed resources where the workflows are dataintensive,
requiring large amounts of data storage, and
where the resources have limited storage resources. Our
approach is two-fold: we minimize the amount of space a
workflow requires during execution by removing data files
at runtime when they are no longer required and we schedule
the workflows in a way that assures that the amount of
data required and generated by the workflow fits onto the
individual resources. For a workflow used by gravitationalwave
physicists, we were able to improve the amount of
storage required by the workflow by up to 57%. We also
designed an algorithm that can not only find feasible solutions
for workflow task assignment to resources in diskspace
constrained environments, but can also improve the
overall workflow performance.
Published in the Seventh IEEE International Symposium on Cluster Computing and the Grid — CCGrid 2007.
