Big Data and AI technologies are certainly the driving force behind a variety of technological innovations, from personal assistants through chatbots, autonomous machines, self-driving cars, and unmanned aircraft systems to natural language interaction. With all the benefits, however, come substantial risks. We have asked Cédric Mauny, Cybersecurity Lead, and Ralf Hustadt, HPC and Big Data Lead at Telindus, to share their insights and explore the implications.
BIG DATA PRESENTS A GREAT OPPORTUNITY NOT ONLY FOR BUSINESSES BUT FOR CYBER CRIMINALS: THE BIGGER THE DATA, THE HIGHER THE REWARD FOR THE MALICIOUS INTRUDERS. WHAT ARE THE MAIN SECURITY CHALLENGES FOR BUSINESSES AND ORGANIZATIONS AS REGARDS BIG DATA?
Ralf Hustadt: One of the main threats posed by emerging technologies is the fact that most of this is fairly new and is developed in a very agile way. Assuming you are coming from the academic field with a state of the art technology in AI that you want to sell, you’ll be focusing on solving the problem, while the surrounding environment is not your top priority. One of the key aspects they typically doesn’t get consideration from day one on, is how to make the system itself secure. The aim is to develop an algorithm or a system that works and solves a given problem. That does not necessarily mean that it is safe per default. We have also seen that when it came to high-performance computing especially with platforms hosted in the academic world, for example. If you look at it, why do large corporates operate their own high-performance computing platforms? Simply because they would never entrust confidential data to a university where a student can more or less walk in with a USB stick. That counts for physical security, for access rights, and all the other security-related topics.
Cédric Mauny: That also counts for privacy. Big Data is a consolidation of lots of different data and this creates new risks for personal data, because inferences could be derived from such large volumes of data, for instance.
Ralf Hustadt: Let me give you an example to illustrate this. Let’s say you found a way to gather a lot of data and you have developed a new algorithm or process to do something with it. You create a system, set it up, give access to the different customers and then you find out that there is no clear separation for the access rights. So, you end up with a system where someone, with a little bit of knowledge and malice, can access somebody else’s data. The key thing here is not that this done on purpose. It’s simply that a Data scientist is not a security or compliance expert and might not aware of these issues. The same counts for the code. If you code something and create a system that works, maybe you forgot to change the default passwords, failed to use secure coding techniques, or have not put the right protection mechanisms in place. All of this constitutes a major security challenge in a Big Data environment.
AND WHAT ABOUT ARTIFICIAL INTELLIGENCE? LET’S IMAGINE THAT CRIMINALS, ROGUE STATE AGENTS, UNSCRUPULOUS COMPETITORS, OR BLACKMAILED INSIDERS DECIDE TO MANIPULATE AN ORGANIZATION’S AI SYSTEMS, FOR EXAMPLE. ANOTHER CONCERN IS THE POSSIBILITY THAT ATTACKERS USE AI IN A VARIETY OF WAYS TO EXPLOIT VULNERABILITIES IN THEIR VICTIMS’ DEFENSES.
Cédric Mauny: Today, Artificial Intelligence is used to strengthen security, but it is also utilized by attackers and this is something that we can’t ignore. Our view is that AI, by accelerating the processing of massive volumes of data, will help detect attack patterns. The whole security community is counting on that trend continuing and improving. However, we must not forget that attackers actually use the same technologies – and sometimes even better than us – to carry out their criminal activities.
Ralf Hustadt: I’ll give you a very simple example that is not exactly related to using AI, it is rather a case of subverting AI. Large companies today use AI for filtering CVs. The HR department puts the CV in pdf format into the machine and the machine selects the resumes of the candidates to be interviewed. What do you do if you know that they are using such a system? And how do you actually bypass the system? It’s quite simple. If you have some keywords in your CV that signal a certain expertise or a high potential, the likelihood that you will be selected for an interview is much higher. Of course, you cannot simply lie on your CV pretending that you hold a PHD from Harvard. Since a human reader is expecting black characters on white paper and a machine cannot distinguish between the real text and the subtitles, you can simply type PHD Harvard in white characters on white paper in a subtitle. So, your keywords are not visible to a human reader and, as the machine doesn’t always give the reasons for its choice, your chances of being screened in are increased.
Cédric Mauny: In this particular case, you can mislead the AI because the selection process is only based on keywords. But special security measures are probably foreseen to address such situations, such as submitting the outcomes produced by the algorithm to a human verification. It is important to keep in mind that automated individual decision-making, in other words deciding by automated means without any human involvement, is framed by the GDPR.
“Today, Artificial Intelligence is used to strengthen security, but it is also utilized by attackers and this is something that we can’t ignore”
There are also some ways for manipulating supervised Machine Learning. Some consist in forcing the machine to learn by behavior and to consider a bad behavior as the normal baseline. Consequently, upcoming attacks may not be detected because being identified as falling within the scope of normal behaviors.
YES, BUT WE CAN PRESUME THAT THIS IS UNLIKELY TO HAPPEN. DON’T YOU NEED AN INSIDE MAN TO PULL OFF SUCH A FRAUD?
Cédric Mauny: In fact, you don’t know the status of your information system when you set up such a Machine Learning process. Therefore, you assume that your information system is secure and trust it. But what about if an attacker is already in place? There is no silver bullet solution to secure such a system. You must rely on composite measures made of usual behavior, rule-based analysis, pattern matching, variance from a company standard, etc.
IN THIS REGARD, BIG DATA ANALYTICS SOLUTIONS, BACKED BY ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, GIVE HOPE THAT BUSINESSES AND PROCESSES CAN BE KEPT SECURE IN THE
FACE OF A CYBERSECURITY BREACH AND HACKING. HOW CAN THESE TECHNOLOGIES BE USED TO IMPROVE DATA PROTECTION TECHNIQUES AND CYBER THREAT DETECTION MECHANISMS?
If your company is a big consulting firm, users are probably behaving differently than in a company like ours, for example. The profile of a consulting firm is probably different from that of an ICT company where we more or less stick to regular working times. For a consulting firm, on the other hand, it’s seems completely normal that people access files in the middle of the night, due to the fact that they are working in a global organization with different time zones, or because they are in a rush to complete an important project.
So, how do you define what an unusual behavior is? You could probably discover it but you simply don’t have the time to explore all these amounts of data. An unusual behavior could be an employee who is accessing specific folders at odd hours and is trying to copy the content of those files, or who is attempting to delete them. It seems legitimate to have a short conversation with this employee to clarify this issue. But you can’t, because of a lack of time and resources. The big advantage of AI is that it enables you to define a normal status and let the machine watch for deviations from expected behavior, twenty-four hours a day and seven days a week.
Cédric Mauny: I believe that AI, Machine Learning, and other emerging technologies could be used to bridge the skills gap and to allow humans to focus on the most important information. AI systems, for instance, are able to process large volumes of data and extract trends from this data, or to highlight specific aspects to focus on. Inversely, human beings are able to take decisions on the basis of small quantities of data, which is very difficult for AI. To put it simply, Artificial Intelligence is highly relevant at the very beginning of the process, when extensive sets of data are involved while humans can focus on small sets of data. This is precisely the way we proceed at Telindus’ CyberSecurity Intelligence and Operations Center (CSIOC) e.g. our SOC.
One of the outcomes of this approach is a set of tools called SOAR – Security Orchestration, Automation and Response. SOAR is a software stack that allows an organization to collect data about security threats from multiple sources and respond to low-level security events without human assistance.
Ralf Hustadt: Typically, humans can naturally perceive if something is not normal. A standard system cannot do that. Therefore, you have to teach an AI everything by rules, whereas man uses his brain. For example, if a computer in your system is accessing a file server, it’s a standard normal behavior. If the computer in question suddenly starts addressing all systems in the middle of the night, it’s definitely not normal. But unless you have told your surveillance system to look for that specific behavior, it won’t recognize it.
One of the SOAR platforms mentioned by Cedric is Splunk Phantom. We have a sister company in the Netherlands, Umbrio. Among other things, this solution is capable of going through large data files on a flow basis, in near real time. That’s the key asset of that system, because it doesn’t really help if you collect data but you need three days to discover that somebody breached your system … three days earlier. »
Cédric Mauny: The same approach applies to Security Operations Centers. Our CSIOC, for instance, relies on different sets of detection rules: based on known patterns – someone establishing connections on his laptop in the middle of the night – on data volume, on variance compared with average values, on AI-based pattern detection, and of course on human analysis.
AI is a very useful tool for processing large data volumes and detecting patterns, but in the end, we rely on human experts. This also counts for our penetration testing projects, for which we use AI-based tools to perform the first level of analysis. After that, as a second step, a human consultant, backed by his experience and knowledge, takes over and makes the decision to dig to go further in one direction or another. We recently conducted an automated penetration testing project for the European Space Agency that illustrates this point. It’s important to understand that these emerging technologies are utilized by both defenders and attackers. Attackers know that defenders are working with AI-based tools. But we have to keep in mind that they are using the same means. We need to know our enemy so as to fight him with the best weapons.
Ralf Hustadt: I think that the big challenge with AI-based automated threat detection solutions is that you need to teach the system. And in order to teach it, you need to have, let’s say, good and bad examples. It’s not enough to tell the system what the normal state is. You also have to show the system what an attack does look like. You need to have both data sets. Let me use a simple metaphor. If you want to teach a system what a cow looks like, you need pictures of cows but you also need pictures of other animals, or else you won’t be able to tell your system what the difference is. Having only data on your network when it is not under attack is similar to trying to find a cow in a herd of animals without knowing what other animals look like. It is hard to teach a system what an attack looks like if you cannot show any examples. That’s currently the main challenge that we are facing.
Cédric Mauny: It’s a paradox. We need to teach AI good and bad examples, but for that we must decide what good and bad examples are. And we could end up with a situation where we would need another AI to teach the AI good and bad examples.
Ralf Hustadt: We are facing the same issues in the financial sector. You cannot teach the system to detect fraudulent transactions if you only have patterns of regular behaviors. And moreover, these behaviors are likely to adapt and change over time.
Cédric Mauny: If a given behavior changes, it is difficult for us to detect the new behavior. Let’s take an ICT company running a new service as an example. A new service triggers new connections, which represents a huge variance from the data of the day before that can be understood as an abnormal event by the surveillance system. This is why it is very important to keep on learning. And continuous learning means sorting between good and bad examples. At the end, this is a question of data quantity vs data quality.
SO, THIS UNDERLINES THE IMPORTANCE OF THE ROLE PLAYED BY HUMAN ACTORS?
Ralf Hustadt: Yes, the machine is always a support. The only situation where a machine is much better than a human being is in processing massive data sets. And since it can be trained to recognize the knowns, AI is perfectly geared towards an initial triage.
AND WHAT ABOUT PRIVACY AND PERSONAL DATA PROTECTION?
Cédric Mauny: I believe that Big Data may raise privacy issues that we don’t know yet. For the time being, we do not have any idea of what will raise from the aggregation of those tremendous amounts of data. Just think of China’s rapidly expanding networks of surveillance cameras. We also leave many traces behind us on the internet that may raise serious issues if they are correlated with one another. Take the Cambridge Analytica scandal, for example.
Ralf Hustadt: I think that the best protection in this case is knowledge and awareness. There are a lot of cases where people are not aware of these issues. For instance, as a customer, you may be involved in one or several loyalty programs. A loyalty program is an agreement between a business and its customers where customers agree to allow the business to track purchases and in return, the business offers rewards such as coupons, cashback, lower prices or a free product or service. Today we have moved from simple paper punch cards to electronic scannable cards and smartphone applications. These programs are important for marketers and data miners, because they allow the business to track the customer’s purchasing behavior in detail.
“Humans can naturally perceive if something is not normal. A standard system cannot do that. Therefore, you have to teach an AI everything by rules, whereas man uses his brain.”
Some uses of your shopping information can seem safe enough: you give those businesses access to what you buy, and they give you discounts or freebies for spending money with them. But what is not immediately apparent is the hidden cost. When you enroll for the loyalty program of your favorite store, you may also agree, more or less voluntarily, to enroll in a targeted advertising program that shares your personal details and buying habits with other companies, who can then use them to target you as a potential customer. And on top of that, those companies can also enrich the data they have collected about you by buying customer data from third-party companies, known as data brokers.
Cédric Mauny: It’s a matter of balance between usability and privacy. And I think that we are well advanced in this regard. I have heard that retail platforms had obtained a patent for shipping products to consumers with autonomous flying delivery drones before they even order them. The technology or process to deploy is called anticipatory shipping. This is the next step of the optimization of the supply-chain: to cut down on processing and delivery times, the items will not be sent from a central warehouse directly to customers, but rather to a nearby shipment hub. The anticipatory shipping process will leverage predictive analytics tools along with the massive volume of customer data accumulated during previous orders, basket or shopping cart contents, or wish-lists.
There is one more thing that people are not aware of. Did you know that, even if you didn’t have a Facebook account, Facebook had probably created one on your behalf, a shadow profile, based on the data that the social network collects from other users? The term shadow profile refers to all information that you have not communicated to Facebook but it still has on you, whether you have an account or not. These shadow profiles are mainly fed by information from your friends, be they online or in real life. It is enough that one of your contacts has shared his phone contacts with Facebook, or his email address book, or that he has mentioned information about you on Messenger, for Facebook to gather and keep them in a kind of virtual file about you. This practice clearly raises questions around data collection, consent, and personal data protection.
Ralf Hustadt: The nice thing is that you can also use it the other way round. It has been done before by the police forces in the U.S. who had identified a network of criminals but didn’t know who was pulling the strings in the shadows. So, they analyzed the criminals’ Facebook accounts and found that there were almost no relationships between them but that they all had connections with one particular individual who turned out to be the godfather of the organization.
Cédric Mauny: To conclude, I would like to underline some important points. AI, Big Data and emerging technologies are the future, not only for enterprises, but for all humankind. They come with enormous opportunities, but also threats that are difficult to predict. The reality is that any innovation might be used for both beneficial and harmful purposes. The development of such technologies needs to be audited, mapped, governed and prevented when necessary.