# MAS3915 : Statistical Inference & Big data Analytics

• Offered for Year: 2021/22
• Module Leader(s): Dr Dennis Prangle
• Lecturer: Dr Pete Philipson
• Owning School: Mathematics, Statistics and Physics
• Teaching Location: Newcastle City Campus
##### Semesters
 Semester 1 Credit Value: 10 Semester 2 Credit Value: 10 ECTS Credits: 10

#### Aims

To gain an understanding of some of the principles of statistical inference and associated results in probability. This will deepen understanding of the fundamental precepts of inference and facilitate the assimilation of more advanced practical methodology, especially for the case when there are multidimensional parameters.

To develop an understanding of the statistical theory underpinning methods and models for the analysis of “big” and, in particular, multivariate data. To gain experience in the application of this theory to a large data set.

Module summary

The course builds on the foundations of inference laid in MAS2901. A variety of types of methods for inference for models with multiple parameters are established, including asymptotic methods for large samples, exact methods and computer-intensive approaches.

More data than ever before are being generated and stored, in a variety of fields such as healthcare and e-commerce. The term “big data” has emerged in acknowledgement of the vast amounts of data now available. By applying statistical analyses to these data sets, we can start to use them to answer important questions, for example, which genetic markers are associated with incidence of a particular disease. Commonly the data sets that arise are multivariate, comprising a large number of observations on many variables. In this module we study how we can learn from data sets of this form. We begin by considering their representation in R, and techniques for generating numerical and graphical summaries. We then turn to consider more formal techniques - often branded "unsupervised learning" - intended to summarise the relationships between variables or observations. Finally we consider a collection of inferential procedures - so-called "supervised learning" techniques - where the goal is to predict a categorical or quantitative response variable on the basis of a collection of covariates. In the latter case, we study linear regression, focusing on overcoming the problems that arise when confronted with a very large number of covariates.

#### Outline Of Syllabus

The multivariate Normal distribution and its principal properties, especially as they relate to asymptotic likelihood methods. Maximum likelihood for multi-parameter models, including asymptotic methods for interval estimation and hypothesis testing. Revision of the idea of sufficiency and the factorization theorem and application to Cramer-Rao lower bounds and the Rao-Blackwell theorem. The bootstrap and its use to compute standard errors and interval estimates.

Introduction to big data, particularly multivariate data, data summaries and use of R data frames. Principal components and cluster analysis. Classification methods using discriminant analysis; use of cross-validation. Methods based on linear regression, including variable selection methods; shrinkage using ridge regression, the lasso and the elastic net; dimension reduction using principal components regression and partial least squares.

#### Teaching Methods

##### Teaching Activities
Category Activity Number Length Student Hours Comment
Scheduled Learning And Teaching ActivitiesLecture91:009:00Present in Person
Scheduled Learning And Teaching ActivitiesLecture91:009:00Synchronous On-Line Material
Structured Guided LearningLecture materials361:0036:00Non-Synchronous Activities
Guided Independent StudyAssessment preparation and completion301:0030:00Completion of in course assessment
Structured Guided LearningStructured non-synchronous discussion181:0018:00Non Synchronous Discussion of Lecture Material
Scheduled Learning And Teaching ActivitiesDrop-in/surgery41:004:00Office Hour or Discussion Board Activity
Guided Independent StudyIndependent study941:0094:00Lecture preparation, background reading, coursework review
Total200:00
##### Teaching Rationale And Relationship

Non-synchronous online materials are used for the delivery of theory and explanation of methods, illustrated with examples, and for giving general feedback on assessed work. Present-in-person and synchronous online sessions are used to help develop the students’ abilities at applying the theory to solving problems and to identify and resolve specific queries raised by students, and to allow students to receive individual feedback on marked work. Students who cannot attend a present-in-person session will be provided with an alternative activity allowing them to access the learning outcomes of that session. In addition, office hours/discussion board activity will provide an opportunity for more direct contact between individual students and the lecturer: a typical student might spend a total of one or two hours over the course of the module, either individually or as part of a group.
Alternatives will be offered to students unable to be present-in-person due to the prevailing C-19 circumstances.
Student’s should consult their individual timetable for up-to-date delivery information.

#### Assessment Methods

The format of resits will be determined by the Board of Examiners

##### Exams
Description Length Semester When Set Percentage Comment
Written Examination1202A80Alternative assessment - class test
##### Other Assessment
Description Semester When Set Percentage Comment
Written exercise1M10written exercises
Written exercise2M10written exercises
##### Assessment Rationale And Relationship

A substantial formal examination is appropriate for the assessment of the material in this module. The course assessments will allow the students to develop their problem solving techniques, to practise the methods learnt in the module, to assess their progress and to receive feedback; these assessments have a secondary formative purpose as well as their primary summative purpose.