大数据和高级分析报告(英文版).pdf
1 EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS JANUARY 2020 EBA/REP/2020/01 EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 2 Contents Abbreviations 3 Executive summary 4 Background 8 1. Introduction 11 1.1 Key terms 12 1.2 Types of advanced analytics 14 1.3 Machine-learning modes 15 2. Current landscape 16 2.1 Current observations 16 2.2 Current application areas of BD more complex models can bring better accuracy and performance but give rise to explainability and interpretability issues. Other issues such as accountability, ethical aspects and data quality need to be addressed to ensure responsible use of BD institutions see potential in the use of advanced analytics techniques, such as ML, on very large, diverse datasets from different sources and of different sizes. Figure 0.2 shows that institutions are using BD it is often based on ML, to discover deeper insights, make predictions, or generate recommendations. Advanced analytics techniques include those such as data/text mining, machine learning, pattern matching, forecasting, visualization, semantic analysis, sentiment analysis, network and cluster analysis, multivariate statistics, graph analysis, simulation, complex event processing, neural networks13. Data science Data science is an interdisciplinary field involving extracting information and insights from data available in both structured and unstructured forms, similar to data mining. However, unlike data mining, data science includes all steps associated with the cleaning, preparation and analysis of the data. Data science combines a large set of methods and techniques encompassing programming, mathematics, statistics, data mining and ML. Advanced analytics is a form of data science often using ML. Artificial intelligence The independent High-Level Expert Group on AI set up by the European Commission has recently proposed the following updated definition of AI14, which has been adopted for the purposes of this report: Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from these data and deciding the best action(s) to take to achieve the given goal. AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions. As a scientific discipline, AI includes several approaches and techniques, such as machine learning (of which deep learning and reinforcement learning are specific examples), machine reasoning (which includes planning, scheduling, knowledge representation and reasoning, search, and optimisation), and robotics (which includes 13 14 ec.europa.eu/futurium/en/ai-alliance-consultation/guidelines#Top EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 14 control, perception, sensors and actuators, as well as the integration of all other techniques into cyber-physical systems). Currently, many AI applications, particularly in the financial sector, are augmented intelligence solutions, i.e. solutions focusing on a limited number of intelligent tasks and used to support humans in the decision-making process. Machine learning The standard on IT governance ISO/IEC 38505-1:2017 defines ML as a process using algorithms rather than procedural coding that enables learning from existing data in order to predict future outcomes. ML is one of the most prominent AI technologies at the moment, often used in advanced analytics due to its ability to deliver enhanced predictive capabilities. ML comes in several modes, and the main ones are described in Section 1.3. 1.2 Types of advanced analytics Advanced analytics techniques extend beyond basic descriptive techniques and can be categorised under four headings: Diagnostic analytics: this is a sophisticated form of backward-looking data analytics that seeks to understand not just what happened but why it happened. This technique uses advanced data analytics to identify anomalies based on descriptive analytics. It drills into the data to discover the cause of the anomaly using inferential statistics combined with other data sources to identify hidden associations and causal relationships. Predictive analytics: this forward-looking technique aims to support the business in predicting what could happen by analysing backward-looking data. This involves the use of advanced data mining and statistical techniques such as ML. The goal is to improve the accuracy of predicting a future event by analysing backward-looking data. Prescriptive analytics: this technique combines both backward- and forward-looking analytical techniques to suggest an optimal solution based on the data available at a given point in time. Prescriptive analytics uses complex statistical and AI techniques to allow flexibility to model different business outcomes based on future risks and scenarios, so that the impact of the decision on the business can be optimised. Autonomous and adaptive analytics: this technique is the most complex and uses forward- looking predictive analytics models that automatically learn from transactions and update results in real time using ML. This includes the ability to self-generate new algorithmic models with suggested insights for future tasks, based on correlations and patterns in the data that the system has identified and on growing volumes of Big Data. EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 15 1.3 Machine-learning modes As mentioned in Section 1.1, ML is a subcategory of AI that uses algorithms able to recognise patterns in large amounts of data via a learning process in order to make predictions based on similar data. For this reason, ML is very often used in predictive analytics solutions. The learning is done by means of suitable algorithms, which are used to create predictive models, representing what the algorithm has learnt from the data in order to solve the particular problem. Their performance improves as more data are available to learn from (to train the model). ML algorithms can be grouped based on the learning mode. In supervised learning, the algorithm learns from a set of training data (observations) that have labels (e.g. a dataset composed of past transactions with a label indicating whether the transaction is fraudulent or not). The algorithm will learn a general rule for the classification (the model), which will then be used to predict the labels when new data are analysed (e.g. data on new transactions). Unsupervised learning refers to algorithms that will learn from a dataset that does not have any labels. In this case, the algorithm will detect patterns in the data by identifying clusters of similar observations (data points with common features). Important problems addressed using unsupervised learning algorithms are clustering, anomaly detection and association. In reinforcement learning, rather than learning from a training dataset, the algorithm learns by interacting with the environment. In this case, the algorithm chooses an action starting from each data point (in most cases the data points are collected via sensors analysing the environment) and receives feedback indicating whether the action was good or bad. The algorithm is therefore trained by receiving rewards and punishments; it adapts its strategy to maximise the rewards. Furthermore, regardless of the mode adopted, some complex ML solutions can use a deep- learning approach. Deep learning means learning using deep neural networks. Neural networks are a particular type of ML algorithms that generate models inspired by the structure of the brain. The model is composed of several layers, with each layer being composed of units (called neurons) interconnected with each other. Deep-learning algorithms are neural networks that have many hidden layers (the number of layers can vary from tens to thousands), which can make their structure very complicated, so much so that they can easily become black boxes. EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 16 2. Current landscape 2.1 Current observations In the context of its ongoing monitoring of financial innovation, and through its interactions with the competent authorities and stakeholders, the EBA has made a number of observations in the area of BD new skills in data science are required and a gap has appeared between business and IT experts. Institutions appear to recognise the importance of explainability and its possible consumer protection implications, and they seem to be working towards addressing these issues. Although no simple and unique approach appears to exist at this stage (academic research is ongoing), institutions seem to prefer the implementation of relatively simple algorithms, which tend to be more explainable, in an effort to avoid black box issues (e.g. a preference for decision trees and random forests rather than deep-learning techniques). The modelling process may be rather iterative to ensure a balance between explainability and accuracy. Data protection and data sharing Today, more than ever before, personal data protection brings new concerns to be addressed, from regulatory, institutional and customer perspectives. Both the General Data Protection Regulation (GDPR)15 and the Principles for effective risk data aggregation and risk reporting of the Basel 15 eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:02016R0679-20160504 the background seems more diverse for data analytics tools and data visualisation tools, where no dedicated tools appears to prevail and in-house solutions are used combined with ad hoc tools as needed.17 16 Principles for effective risk data aggregation and risk reporting, January 2013, BCBS (bis/publ/bcbs239.pdf). 17 A non-exhaustive list of tools mentioned by banks responding to the EBAs questionnaire is as follows: Python, R and Scala for programming language; R, Scikit-Learns, Pandas, Tensor flow and Keras functions for data science libraries; Git, EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 19 Moreover, it appears that it is not always the case that the aforementioned tools support the entire data science process that leads to a specific output in a reproducible way, as in some institutions only the source code is recoverable while in other institutions all relevant events are reproducible. 2.2 Current application areas of BD Spark and Hadoop for big data storage and management; KNIME, H20 and Elastic/Kibana for data analytics; and R Shiny and JavaScript for data visualisation. EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 20 Figure 2.1: Current use of Big Data Analytics for risk management purposes Source: EBA risk assessment questionnaire (spring 2019) Moreover, similar processes, such as know your customer processes, can involve leveraging BD however, the rationale is broadly the same. The institution relies on a predictive model previously trained with backward-looking data on customers behaviour cross-referenced with supplementary data, such as transactional data, for greater accuracy. Some extra features can be set up to enrich the model, such as rules that would highlight an obvious fraud pattern (e.g. a speed feature combining for one given credit card the timestamp and retailer location for successive payment transactions: the higher the value of the speed feature, the more likely it is that fraudulent copied credit cards are in use). Predictive models may rely on supervised ML algorithms (fed by training data labelled as fraudulent or not) that can learn the fraudulent patterns based on past frauds and consequently detect a potential fraud case. Unsupervised machine algorithms, aiming to detect anomalies in behaviour (reflecting rare or unusual patterns), can also be used, in combination with predictive models, to ensure sufficient predictive capability and accuracy in the fraud detection process. In operational processes, when it comes to detecting fraud, predictive models can be applied in real time with the purpose of preventing fraudulent transactions. As part of the business process, 0% 10% 20% 30% 40% 50% i. Fraud detection ii. Customer on-boarding process iii. Other AML/CFT processes iv. Data quality improvementProcess optimisati on In use / launched Pilot testing Under development Under discussion No Activity EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 24 the model receives as input the flow of business data to be checked and gives as a result a score assessing the potential for fraud for each entry in the flow. When the score given by the model for a particular entry reaches a predefined threshold, the entry is considered suspicious, i.e. potentially fraudulent. An alert is then triggered and the entry (i.e. financial transaction) is quarantined until a compliance officer can manually check it. If the model is accurate, the compliance officer should have fewer cases to check and consequently be able to perform a more efficient assessment of the cases flagged as potentially fraudulent. The compliance officer makes a decision based on the explainable output provided by the predictive model and on the ad hoc investigation that he or she carries out. To further improve the efficiency of the model, new recognised patterns resulting from the fraud detection process can be collected to retrain the model on a regular basis (a feedback loop). However, the use of BD therefore, data quality risks need to be identified and incorporated into an overall risk management framework. The concept of data quality is overarching and needs to be considered at each step shown in the advanced analytics methodology presented in Figure 3.1. Like data security, data quality need to be considered throughout the whole BD institutions cannot outsource responsibility to external providers and thus they remain accountable for any decisions made. Moreover, adequate scrutiny of and due diligence on data obtained from external sources, in terms of quality, bias and ethical aspects, could be included in the risk management framework. 3.3.2 Skills and knowledge BD arxiv/pdf/1811.03163.pdf 33 arxiv/pdf/1906.09293.pdf EBA REPORT ON BIG DATA AND ADVANCED ANALYTICS 37 Similarly, the need for explainability is also strong when a human is involved after the AI/ML model, is required to take the final decision based on the results produced by the model and therefore needs to understand why a particular result was generated. That explanation will be even more important when the decision impacting a consumer is taken fully automatically by the machine: in that case, regulations such as the GDPR have introduced the right of the data subject (i.e. the person who is the subject of the data being processed) to receive meaningful i