NEW YORK-- Organizations planning to use Hadoop to aggregate and analyze data from multiple sources need to consider potential security issues beforehand, according to IT professionals at the Hadoop World conference here.
Hadoop makes it easier for organizations to get a handle on the large volumes of data being generated each day, but can also create problems related to security, data access, monitoring, high availability and business continuity, Larry Feinsmith, managing director of IT operations at banking giant JPMorgan Chase, said in a keynote speech at Hadoop World on Nov. 8.
Data is growing faster than ever, thanks to blogs, social media networks, machine sensors and location-based data from mobile devices. Companies can analyze the data to gain insights into customers and industry trends they weren't able to have in the past. However, organizations are faced with the prospect of somehow managing and securing petabytes and petabytes of data, Richard Clayton, a software engineer with Berico Technologies, said in a security panel at the conference.
The data is not monolithic, as there may be mixed classifications and varying levels of security sensitivity, Clayton said. As an IT services contractor for federal agencies, Berico Technologies had to consider varying encryption technologies, retention policies and access requirements for individual pieces of data.
Most organizations don't have the visibility they need to understand what they have and to properly secure it, Ken Cheney, vice president of business development and marketing at storage management software vendor Likewise, told eWEEK before the conference. The visibility is essential to "know who owns the data, and who has access to it," Cheney said.
Enterprises need to implement appropriate security controls for enforcing role-based access to the data, according to Clayton. However, he felt that built-in Hadoop Distributed File System (HDFS) security features, such as Access Control Lists and Kerberos, are not adequate to meet enterprise needs.
Many organizations tie the data being stored to identity management systems, such as Active Directory or LDAP, as the "source of truth," according to Cheney. By linking the data with an actual identity, IT departments can track what is being done with the data and by whom, he said.
Another big concern for organizations using Hadoop is the fact that analyzing the data within the environment creates new datasets that also need to be protected, Clayton said. The data being aggregated in one place also increases the risk of data theft or accidental disclosures, he said. An effective data security approach in many Hadoop environments would be to encrypt the data at the individual record level, while it is in transit or being stored, according to Clayton.
Many government agencies are putting Hadoop-stored data into separate "enclaves," or network segments, to ensure that only people with the proper level of security clearance can view the information, he said. Others are building firewalls that protect Hadoop environments and restrict access, Clayton said.
Some agencies have opted out of using Hadoop databases altogether because of these data access concerns, according to Clayton.
Large companies such as IBM, Yahoo and Google have been using Hadoop for years, but it's only recently that large enterprises have started looking at Hadoop to rein in their out-of-control data.
JPMorgan Chase has been using the open-source storage and data analysis framework for almost three years in various applications, such as fraud detection, IT risk management and self-service, Feinsmith said. Chase relies on Hadoop to collect and store Weblogs, transaction data and social media information on a common platform and runs data mining and analytics applications to gather intelligence, according to Feinsmith.