I am interested in doing experimental research on large-scale distributed systems, such as clusters, grids, and clouds, with a particular focus on performance and reliability aspects of such systems.
Big Data Research
- Contributed to the design of a dynamic resource allocation mechanism that balances the allocations of multiple MapReduce framework instances to provide each instance similar levels of quality of service. (ACM SIGMETRICS'14)
- Designed and analyzed various machine learning-based performance models to automatically tune Hadoop MapReduce applications. (MASCOTS'13, Best Paper Award)
- Contributed to the design of a system to deploy multiple MapReduce framework instances to multi-cluster systems. (MTAGS'12 @ SuperComputing
, Best Paper Award)
- Designed and assessed several heuristics for energy-efficient scheduling of Hadoop MapReduce workloads on heterogeneous clusters comprising low-power and high-performance processors. (GCM'11 @ ACM/IFIP/USENIX Middleware
)
Cloud Computing Research
- Evaluated the variability of the performance delivered by production cloud services using the traces that we collected from Amazon Web Services and Google App Engine. (CCGRID'11
)
- Using scientific computing workloads evaluated the performance of four production clouds: Amazon Elastic Compute Cloud (EC2), Mosso, Elastic Hosts, and GoGrid. (IEEE TPDS
, Biggest Impact Award)
- Developed a framework, C-Meter, for running performance tests on IaaS clouds. (Cloud'09
)
Grid Computing Research
- Investigated various overload control techniques to control overload in multi-cluster computer systems with extensive experiments in the DAS-3 multi-cluster grid in the Netherlands. (GRID'11
)
- Investigated the performance of dynamic workflow scheduling policies with both realistic simulations and real system experiments in the DAS-3 multi-cluster grid. (HPDC'10
)
- Investigated the impact of static and dynamic overprovisioning techniques on the performance consistency of Bag-of-Tasks workloads. (Grid'10
)
- Evaluated the performance and benefits of job runtime and queue wait time predictions in grids using traces gathered from various research and production grid environments. (HPDC'09
)
Failure Characterization in Large-Scale Distributed Systems
- Proposed a model for groups of failures (space-correlated failures) in large-scale distributed systems using fifteen traces from the Failure Trace Archive. (EuroPar'10
)
- Investigated and proposed a model for time-correlated failures (investigated auto-correlation, periodicity and peaks of failures) in large-scale distributed systems using nineteen traces from the Failure Trace Archive. (Grid'10
)
Incremental Placement of Interactive Perception Applications
- This research has been done in the context of the SLIPStream project when I was an Intel/CMU Summer Research Fellow at Intel Research Labs, Pittsburgh.
- Designed and implemented algorithms for dynamic and incremental placement of perception applications to minimize the end to end latency subject to capacity, placement, and migration cost constraints. (HPDC'11
)