About this workshop:
This workshop is sponsored by the NSF's XSEDE (The Extreme Science and Engineering Development Environment, https://www.xsede.org/) program. Staff members from Texas Advanced Computing Center (https://www.tacc.utexas.edu/) will teach the workshop. The workshop is organized as four separate sessions to cover various topics in Big Data Analysis. Although participants are strongly encouraged to attend all sessions, the workshop is designed in a way such that participants may just attend selected sessions based on their background, schedule and needs.
Ruizhu Huang is a research associate in the data intensive computing group at TACC. He has years of experience in big data analytics, machine learning, and data visualization. He has involved in various projects developing technologies that bridge the gap between traditional machine learning approaches and next-generation, data intensive computing methods involving High-Performance Computing (HPC) resources
Amit Gupta is a Research Engineering/Scientist Associate III in the Data Mining and Statistics group at TACC. His research interests are in Distributed Systems and Tools to enable scaling of Big Data Applications on HPC infrastructure, Parallel Programming and Information Retrieval Systems for text. He has extensive experience with various applications ranging from scaling Transportation Simulations to Text Mining of Biological literature. He earned an MS in Computer Science from the University of Colorado at Boulder with Thesis research in the area of Operating Systems.
Dr. Weijia Xu is a research scientist and manager of Data Mining and Statistics group at TACC. He received his Ph.D. in Computer Science from The University of Texas At Austin. Dr. Xu has over 50 peer-reviewed conference and journal publications in similarity-based data retrieval, data analysis, and information visualization with data from various scientific domains. He has served on program committees for several workshops and conferences in big data and high-performance computing area.
Part One: Introduction to Hadoop and Spark [register here]
- basic concepts used in MapReduce programming model
- major components of a Hadoop cluster
- how to get started with Hadoop on your own computer and with computing resources at TACC
- introduce Spark programming models and how Spark can work with a Hadoop cluster
- different ways to use Hadoop and Spark for analysis
- review Spark programming model
- basic introduction to the Scala programming language
- how to run a Spark application
- keys features to make scalable application
- how to get started development using Spark after the class
- running batch jobs with different cluster deployment mode
- running interactive jobs
- explore existing libraries and applications including Hadoop streaming, MLlib, SparkSQL and Graph X
- Using Hadoop/Spark with R and Python