Professional activities
I recently joined Splunk working on Product Analytics. This page will be updated soon with more information.
I was previously a member of the Performance Team at Cloudera. My work involved both internal and competitive performance analysis and optimization. I specialized in performance across multiple components of our big data platform, including Hadoop MapReduce, Impala, HBase, Search, and Hive - someone had to make sure the entire Hadoop ecosystem runs fast together.
I'm a lead author to Statistical Workload Injector for MapReduce (SWIM), an open source tool that allows someone to synthesize and replay MapReduce production workloads.
SWIM has become a standard MapReduce performance measurement tool used to certify many Cloudera partners.
You can learn more about SWIM in our
MASCOTS 2011 and VLDB 2012 papers.
I contributed to the first generation big data industry standard benchmarks within the Transactional Processing Council (TPC) - TPC-DS 2.0, TPCx-BigBench, TPCx-HS. I also serve as program committee member and reviewer to various conferences, publications, and NSF funding panels.
I hold a computer science PhD with MBA minor from UC Berkeley. My dissertation is Workload-Driven Design and Evaluation of Large Scale Data Centric Systems. I worked with Professor Randy Katz at the AMP Lab. My dissertation committee also included Professors Vern Paxson and Ray Larson.
Recent work
- Data Quality: Experiences and Lessons from Operationalizing Big Data. A. Ganapathi, Y. Chen. IEEE Big Data. 2016.
- Scaling SQL-on-Hadoop for BI. Y. Chen, D. Kumar. Strata Hadoop World London. 2015. See also extended version on Cloudera Engineering Blog.
- Five Challenges for Energy Efficient Computing Research. Y. Chen. Invited talk. National Science Foundation Workshop on Sustainable Data Centers. 2015.
- The Truth About MapReduce Performance on SSDs. K. Kambatla, Y. Chen. Strata Hadoop World San Jose. 2015. See also extended version at the Large Installation System Administration Conference (LISA) 2014, and abbreviated version on Cloudera Engineering Blog (This post was a top-ten Cloudera Engineering Blog of 2014).
- Underhyped - Big Data as an Advance in the Scientific Method. Y. Chen. Cloudera Vision Blog. 2014.
- Five Pitfalls of Benchmarking Big Data Systems. Y. Chen. G. Shapira. Big Data Spain. 2014. See also talk at LISA 2014, post on Cloudera Engineering Blog, featured article for IEEE Transactions on Services Computing Jan/Feb 2016.
- On Big Data benchmarks. Interview with Francois Raab and Yanpei Chen. F. Raab, Y. Chen. R. V. Zicari (Ed). ODBMS Industry Watch. 2014. Also appeared as Big Data Benchmarks: Toward Real-Life Use Cases on Cloudera Engineering Blog.
- Fine Tuning a Hadoop Cluster to Increase Performance. A. Acosta, B. Gowda, Y. Chen. Panel, Dell-Intel-Cloudera. Hadoop Summit. 2014. See also post-panel interview.
- Rigorous and Multi-Tenant Hbase Performance Measurment. G. Kamat, Y. Chen. Hadoop Summit. 2014. Slides.
- Impala Performance Update: Now Reaching DBMS-Class Speed. J. Erickson, G. Rahn, M. Kornacker, Y. Chen. Cloudera Engineering Blog. 2014. (This post was a top-ten Cloudera Engineering Blog of 2014)
- From TPC-C to Big Data Benchmarks: A Functional Workload Model. Y. Chen, F. Raab, R. Katz. Lecture Notes on Computer Science, Volume 8163, 2014. Extended proceedings from Workshop on Big Data Benchmarks, 2012.
- Configuring Impala and MapReduce for Multi-tenant Performance. Y. Chen, P. Gokhale, A. Singla. Cloudera Engineering Blog. 2013.
- Interactive Query Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads. Y. Chen, S. Alspaugh, R. Katz. International Conference on Very Large Data Bases (VLDB) 2012.
- We Don't Know Enough to make a Big Data Benchmark Suite - An Academia-Industry View. Y. Chen. Workshop on Big Data Benchmarking. 2012.
- Understanding TCP Incast and Its Implications for Big Data Workloads. Y. Chen, R. Griffith, D. Zats, A. D. Joseph, R. Katz. USENIX ;login: Magazine. Vol. 37. No. 3. pp. 24-38. June 2012.
- Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis. Y. Chen, S. Alspaugh, D. Borthakur, R. Katz. European Conference on Computer Systems (EuroSys) 2012.
- Challenges and Opportunities for Managing Data Systems Using Statistical Models. Y. Chen, A. Ganapathi, R. Katz. 2011. IEEE Data Engineering Bulletin. Vol. 34. No. 4. pp. 53-60. December 2011.
- Hadoop and Performance. T. Lipcon and Y. Chen. 2011. Hadoop World. November 2011.
- Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis. Y. Chen, K. Srinivasan, G. Goodson, R. Katz. 23rd ACM Symposium on Operating Systems Principles (SOSP) 2011.
- The Case for Evaluating MapReduce Performance Using Workload Suites. Y. Chen, A. Ganapathi, R.Griffith, R. Katz. 19th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) 2011.
- Integrating Renewable Energy Using Data Analytics Systems: Challenges and Opportunities. A. Krioukov, C. Goebel, S. Alspaugh, Y. Chen, D. Culler, R. Katz. IEEE Data Engineering Bulletin. Vol. 34. No. 1. pp. 3-15. March 2011.
- What's New about Cloud Security. Y. Chen, V. Paxson, R. Katz. UC Berkeley EECS Tech Report. 2010.
Non-technical interests
I am interested in how technology affects society at large.
I believe we computer scientists should
participate in relevant discussions and contribute our perspectives.
Half-presentable work from the past:
- Gender Balance in UC Berkeley EECS. Y. Chen & J. Nam. 2007.
Self-started research project. Results released in Spring 2007 to the Chair of EECS and the Diversity Director of the Department. Met with UC Berkeley Chancellor & the Assoc. Vice Provost for Faculty Equity to discuss results. Report forwarded to the Vice Chancellor for Equity and Inclusion, as well as other entities on campus.
Personal stuff
I was born in China, my whole family is in the United States, and I
am technically an Australian. I hold interests unique to each country - I
do Chinese caligraphy, I love college football (Go Bears!!!), and I still
itch to play cricket. I went to an all-guys public high school in
Australia, and apparently I still owe the teachers there some homework and
beer and lost bets. I had a wonderful time as both undergrad and grad student at UC
Berkeley. I'm working now, but I believe one should always be a student and constantly learning. My parents are
chemists/chemical engineers, and taught me how not to have fun with chemicals. They live and work in New York.
I used to play computer games a lot (most recently Civilization 4, Rome Total War, and Starcraft 2),
I once was addicted to Friends, I'm reading the Game of Throne series and plan to re-read
100 Years of Solitude (didn't fully "get it" the first time but found it awesome),
and I love Clint Eastwood and Al Pacino movies. I play tennis/badminton/table-tennis,
I've made a home-recorded album for classical guitar and am halfway recording a second one, I hike sporadically, I've stopped doing archery regularly (not enough physical exertion),
and I've lost $8 playing chess against homeless guys.
Recently I've taken up fencing - initially did foil, now transitioned to epee (French grip, old school, I know). I'm currently
focused on trying to lose more stylishly in fencing, but alway looking for the next hobby too. I hardly ever refuse to eat chocolate.
The rest to be filled later ...
Past awards
National Science Foundation Graduate Research Fellow
UC Berkeley Regents and Chancellors Scholar
UC Berkeley Lipson Humanistic Values Scholar
Premier's Awards in Math, Physics, Chemistry (Victoria, Australia)
Australia Student Prize 2001
|