Big data tools typically seek to combine the functions of storage and analytics/BI. This is because the model of gathering data in one repository and sending it to an external analytics engine tends to break down as the volume of data soars.
With so much unstructured data available from so many channels, it makes sense to combine storage and analytics into one system. As such, analytics tools now have added storage capabilities; conversely, traditional storage vendors have expanded into the realm of analytics.
Read more: Top Business Intelligence Trends for 2021
The market therefore includes a hodgepodge of products. Some are focused on analytics, others on storage with analytics capabilities. More than a few tools perform one of the two functions, but integrate closely with another vendor to provide full big data analytics functionality.
Big Data Tools Key Features
Due to the way the market has evolved, the feature sets vary widely. The following core features are found in most systems:
- Large-scale storage: Most big data analytics tools house a LOT of storage. Yes, some are larger than others. Some take advantage of larger storage repositories provided by partners. But regardless of the architecture, these tools must still be able to handle a lot of data.
- Analytics engine: Gathered data needs to be analyzed. Some analytics engines are better than others, as they have been at it longer. But even relative newcomers to this space still need to have decent analytics capabilities built in.
- Data cleansing: Analytics results depend on the quality of the data. Veteran analytics companies are skilled at data cleansing. Those newer to the field are still finding their feet in this arena.
- In memory: To speed the process of coming up with a result, some data sets or subsets are able to sit in memory. By analyzing the data there, insights can be derived in a fraction of the time it takes to do analysis of disk-based data.
Top Big Data Tools and Software
CIO Insight evaluated the various vendors in big data analytics. Here are our top picks, in no particular order:
SAS Enterprise Miner is one of many analytics tools within the SAS arsenal. It aims to dramatically shorten model development time for data miners and statisticians. An interactive, self-documenting process flow diagram environment efficiently maps the entire data mining process to produce the best results. The company claims this big data tool offers more predictive modeling techniques than any other commercial data mining package.
- Batch processing
- Data preparation, summarization, and exploration
- Preparation tools address missing values, filter outliers, and develop segmentation rules
- Predictive and descriptive modeling
- Suite of statistical, data mining, and machine-learning algorithms
- Open source integration with R
- High-performance data mining nodes
- SAS Rapid Predictive Modeler steps nontechnical users through data mining tasks
- Model comparisons, reporting, and management
- Automated scoring in SAS, C, Java, and PMML,
- Scoring code is deployable in SAS, on the web, or directly in relational databases or Hadoop.
IBM Db2 Big SQL is an enterprise-grade, hybrid ANSI-compliant SQL-on-Hadoop engine, delivering massively parallel processing and advanced data query. This data virtualization tool is for accessing, querying, and summarizing data across the enterprise. It offers a single database connection or query for disparate sources, such as Hadoop HDFS and WebHDFS, RDMS, NoSQL databases, and object stores.
- Can be integrated with Cloudera Data Platform, or accessed on IBM Cloud Pak for Data
- Enterprise-grade SQL-on-Hadoop performance using elastic boost technology
- Low latency with support for ad-hoc and complex queries
- High performance, security, and federation capabilities
- Hybrid Hadoop engine exploits Hive, HBase, and Apache Spark concurrently
- Role-based access control, row-based dynamic filtering, column-based dynamic masking, and Apache Ranger integration
- Standards-compliant Open Database Connectivity and Java Database Connectivity
- Allows access to the database with specific products or tooling that allow only Open Database Connectivity or Java Database Connectivity
Cloudera Enterprise Data Hub (EDH) delivers an integrated suite of analytic engines ranging from stream and batch data processing to data warehousing, operational database, and machine learning (ML). It works in conjunction with Cloudera SDX, which applies security and governance. This enables users to share and discover data for use across workloads.
- Accelerates ML from research to production
- Built-in data warehouse delivers a cloud-native, self-service analytic experience
- Data warehouse is integrated with streaming, data engineering, and ML analytics
- Governance for all data and metadata on private, public, or hybrid clouds
- Operational database-as-a-service brings flexibility to Apache HBase
- Database management capabilities like auto-scale, auto-heal, and auto-tune
- Integrations with Cloudera Data Platform and services
- CDP Data Engineering built on Apache Spark for automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and management tools
Oracle Big Data Service is a Hadoop-based, managed service that includes a data lake, a data warehouse, and more. The Oracle Autonomous Data Warehouse is part of it. This cloud data warehouse eliminates all the complexities of operating a data warehouse, securing data, and developing data-driven applications.
- Automates provisioning, configuring, securing, tuning, scaling, and backing up of data warehouse
- Tools for self-service data loading, data transformations, business models, and automatic insights
- Converged database capabilities enable simpler queries across multiple data types and ML analysis
- Oracle big data services via Lake House
- Object storage and Hadoop-based data lakes for persistence and Spark for processing
- Analysis through Oracle Cloud SQL or the analytical tool of choice
- Cloud Infrastructure Data Flow is a managed Apache Spark service with no infrastructure to deploy or manage
- Cloud Infrastructure Object Storage enables storage of data in its native format
- Cloud Infrastructure Data Catalog helps search, explore, and govern data
- Cloud Infrastructure Data Integration extracts, transforms, and loads data
Qubole offers a secure and open data lake platform to accelerate machine learning, streaming, and ad hoc analytics. Its customer base includes Expedia, Disney, Lyft, and Adobe. Instead of proprietary formats, proprietary SQL extensions, proprietary metadata repository, and lack of programmatic access to data, the Open Data Lake has no vendor lock-in while supporting a diverse range of analytics.
- Author, save, template, and share reports and queries via Workbench
- Build data pipelines combining multiple streaming or batch data sources via Assisted Pipeline Builder
- Offline editing, multi-language interpreter, and version control capabilities
- Qubole Notebook monitors application status and job progress
- Secure access with encryption and RBAC controls
- Build and manage metadata, explore data dependencies, and provide indices and statistics
Scality RING protects data at scale. The distributed architecture is redundant without bottlenecks to scale out to dozens of petabytes of capacity in a single system. It integrates file and object storage for workloads focused on high-capacity unstructured data. Data is protected even against data center outages through geo-distribution.
- Scalable data lake that can be accessed by applications such as Hadoop and Spark
- Runs on-premises on commodity hardware and extends into the public cloud
- Certified by over 100 ISV solutions
- Enterprise-grade data durability, self-healing, security, encryption, and multi-tenancy
- Integrated hybrid-cloud data management to AWS, Azure and Google via XDM
- Built-in cloud archiving, bursting and disaster recovery solutions.
- POSIX compatible file system, with standard NFS v4/v3 and SMB 3.0 file interfaces
- Policy-based data replication and erasure-coding for up to eleven 9s data durability
- Integrated hybrid-cloud data management
- Encompasses multi-cloud namespaces, native Azure object storage support, and bidirectional compatibility with S3
The SAP Business Technology Platform (BTP) is an integrated offering comprised of four technology portfolios: database and data management, application development and integration, analytics, and intelligent technologies. SAP Database and Data Management enable control of the data landscape with an end-to-end view of all data through a single gateway. SAP databases securely provide transactional and analytical processing across on-premises, hybrid, and multi-cloud environments. SAP is used by a huge number of organizations around the world to manage data, ranging from global businesses to data-centric SMEs.
- In-memory analytics and processing in near-real time
- Huge collection of database tools such as SAP HANA, SAP HANA Cloud, SAP IQ, more
- Data management tools include SAP Information Steward, SAP PowerDesigner, more
- Cloud database management
- Governance of data including compliance and privacy
- Consistently highly ranked in Gartner Magic Quadrants (MQ) and Forrester Wave analyses
Tintri offers an intelligent storage platform featuring AI-driven autonomous operations, app-level visibility, and analytics to drive down storage management costs by up to 95%. Set up groups, then apply data protection and service level policies that apply with no manual intervention required.
- Cloud-based SaaS solution
- Tintri Global Center (TGC) automatically controls apps
- Set policies for service levels, cloning, snapshots, and replication
- Federates up to 64 VMstore all-flash systems
- Crunches one million stats about apps every ten minutes
- Troubleshoot latency in seconds across host, network, and storage
- ML algorithms model every storage and compute need for up to 18 months into the future
- Historical metadata is used to predict capacity, performance, working set, and compute needs
Read next: Best ERP Software & Systems for 2021