Name: Apache Spark
Rating: 8.7 (158 reviews)
Author: Apache

Overview

Recent Reviews

TrustRadius Insights

December 15, 2023

Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets …

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

10 out of 10

August 18, 2023

Incentivized

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that …

Lightning Fast In-Memory Cluster Computing Framework

10 out of 10

August 30, 2022

Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of …

Apache Spark is the next generation of big data computing.

9 out of 10

April 18, 2022

We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of …

Apache Spark in Telco

10 out of 10

July 22, 2021

Incentivized

Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in …

good solution for long and narrow data

9 out of 10

May 20, 2021

Incentivized

We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is …

Apache Spark - your go to technology for distributed data processing

9 out of 10

May 03, 2021

Incentivized

We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
We use …

Epic Computation Engine Framework

9 out of 10

November 08, 2020

Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to …

A powerhouse processing engine.

9 out of 10

September 19, 2020

Incentivized

We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its …

Apache Spark -- The best big data solution

8 out of 10

January 12, 2020

We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast …

Great open source tool for data processing

9 out of 10

December 13, 2019

Incentivized

We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by …

Want to save dollars, resources and time processing big data, switch to Apache Spark

9 out of 10

March 27, 2019

Incentivized

We sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. …

Apache Spark Review

7 out of 10

March 16, 2019

Incentivized

We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the …

Apache Spark - defacto for big data processing/analytics

9 out of 10

December 14, 2018

Incentivized

Used as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for …

Very useful application for Big Data processing and excellent for large volume production workflows

10 out of 10

August 28, 2018

Incentivized

Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop …

Apache Spark: One stop shop for distributed data processing, machine learning and graph processing

10 out of 10

July 21, 2018

Incentivized

We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data …

Read all reviews

Reviewer Pros & Cons

View all pros & cons

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Support for advanced analytics is not available - MLlib has minimalistic analytics.

Ananth Gouri

Assistant Professor

The National Institute of Engineering, Mysuru (Education Management, 501-1000 employees)

Return to navigation

Product Demos

Spark Project | Spark Tutorial | Online Spark Training | Intellipaat

YouTube

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

YouTube

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

YouTube

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

YouTube

Introduction to Databricks [New demo linked in description]

YouTube

Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat

YouTube

Return to navigation

Product Details

About
Tech Details

What is Apache Spark?

Apache Spark Technical Details

Operating Systems	Unspecified
Mobile Application	No

Return to navigation

Comparisons

View all alternatives

Compare with

Reviews and Ratings

(158)

May 7th 2024

Community Insights

TrustRadius Insights are summaries of user sentiment data from TrustRadius reviews and, when necessary, 3rd-party data sources. Have feedback on this content? Let us know!

Business Problems Solved
Pros
Cons
Recommendations

Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets and generating summary statistics. Users have found it particularly useful for creating simple graphics when working with big data, making it a valuable asset for analytics departments. It is also used extensively in the banking industry to calculate risk-weighted assets on a daily and monthly basis for different positions. The integration of Apache Spark with Scala and Apache Spark clusters enables users to load and process large volumes of data, implementing complex formulas and algorithms. Additionally, Apache Spark is often utilized alongside Kafka and Spark Streams to extract data from Kafka queues into HDFS environments, allowing for streamlined data analysis and processing.

One of the key strengths of Apache Spark lies in its ability to handle large volumes of retail and eCommerce data, providing cost and performance benefits over traditional RDBMS solutions. This makes it a preferred choice for companies in these industries. Furthermore, Apache Spark plays a crucial role in supporting data-driven decision-making by digital data teams. Its capabilities allow these teams to build data products, source data from different systems, process and transform it, and store it in data lakes.

Apache Spark is highly regarded for its ability to perform data cleansing and transformation before inserting it into the final target layer in data warehouses. This makes it a vital tool for ensuring the accuracy and reliability of data. Its faster data processing capabilities compared to Hadoop MapReduce have made Apache Spark a go-to choice for tasks such as machine learning, analytics, batch processing, data ingestion, and report development. Moreover, educational institutions rely on Apache Spark to optimize scheduling by assigning classrooms based on student course enrollment and professor schedules.

Overall, Apache Spark proves itself as an indispensable product that meets the needs of various industries by offering efficient distributed data processing, advanced analytics capabilities, and seamless integration with other technologies. Its versatility allows it to support a wide range of use cases, making it an essential tool for anyone working with big data.

Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.

Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.

Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.

Challenging to Understand and Use: Some users have found Apache Spark to be challenging to understand and use for modeling big data. They struggle with the complexity of the software, leading to a high learning curve.

Lack of User-Friendliness: The software is considered not user-friendly, with a confusing user interface and graphics that are not of high quality. This has resulted in frustration among some users who find it difficult to navigate and work with.

Time-Consuming Processing: Apache Spark can be time-consuming when processing large data sets across multiple nodes. This has been reported by several users who have experienced delays in their data processing tasks, affecting overall efficiency.

When using Spark for big data tasks, users commonly recommend familiarizing yourself with the documentation and gaining experience. They emphasize investing time in reading and understanding the documentation to overcome any initial challenges. As users gain experience, they find working with Spark becomes easier and more efficient.

Users also suggest utilizing Spark specifically for true big data problems, where its capabilities and performance shine. They highlight that Spark is well-suited for tackling large-scale data processing tasks.

Additionally, users find value in leveraging the R and Python APIs in Spark. These APIs allow them to work with Spark using familiar programming languages such as R and Python, making it easier to analyze and process data.

Overall, users advise diving into the documentation, utilizing Spark for big data challenges, and leveraging the R and Python APIs to enhance their experience with Spark.

Attribute Ratings

Reviews

(1-23 of 23)

Sort By *

Companies can't remove reviews or game the system. Here's why

August 18, 2023

Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve

Ananth Gouri

Assistant Professor

The National Institute of Engineering, Mysuru (Education Management, 501-1000 employees)

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.

Pros and Cons

Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy

Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.

Likelihood to Recommend

Well suited: To most of the local run of datasets and non-prod systems - scalability is not a problem at all. Including data from multiple types of data sources is an added advantage. MLlib is a decently nice built-in library that can be used for most of the ML tasks.

Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.

August 30, 2022

Lightning Fast In-Memory Cluster Computing Framework

Riyaz Khan

Staff Engineer

Nagarro (Information Technology & Services, 10,001+ employees)

Score 10 out of 10

Vetted Review

Verified User

Use Cases and Deployment Scope

Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of transactions coming in. After a lot of critical issues we migrated to delta lake. Now, we are using Apache Spark Streaming to deal with all real-time transactions. For batch data as well, we are pretty much handling TBs of data using Apache Spark.

Pros and Cons

Realtime data processing
Interactive Analysis of data
Trigger Event Detection

Machine Learning
GraphX Lib
True Realtime Streaming

Likelihood to Recommend

Well suited for batch processing and provides performance improvement through optimization techniques. Data Streaming is getting better with Apache Spark Structured Streaming. Out of memory issues and Data Skewness problems when data is not properly organized. Integration with BI tools such as Tableau could be better.

April 18, 2022

Apache Spark is the next generation of big data computing.

Steven Li

Senior Software Developer (Consultant)

Morgan Stanley (Banking, 10,001+ employees)

Score 9 out of 10

Vetted Review

Verified User

Use Cases and Deployment Scope

We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.

Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.

Pros and Cons

DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
Calculation in-memory.
Cluster to distribute large data of calculation.

It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.

Likelihood to Recommend

For a large volume of data to be calculated, Apache Spark is the go-to; for intermediate or small volumes of data sets, Apache Spark is an option.

July 22, 2021

Apache Spark in Telco

Verified User

Engineer in Information Technology

Telecommunications Company, 10,001+ employees

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

Apache Spark is being widely used within the company. In Advanced Analytics department data engineers and data scientists work closely in machine learning projects to generate value. Spark provides unified big data analytics engine which helps us easily process huge amount of data. We are using Spark in projects like churn prediction, network analytics.

Pros and Cons

Machine learning on big data
Stream processing
Lakehouse with Delta

Indexing
Mllib
Streaming

Likelihood to Recommend

Apache Spark is very good for prosessing large amount of data but not that good if you need many joins or low latency. With combination of delta engine performance improved alot. Especially having ACID support, time travel features and consistent view for simultaneous read and writes it’s now ready for next level.

May 20, 2021

good solution for long and narrow data

Verified User

Analyst in Professional Services

Financial Services Company, 10,001+ employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We are building a model and due to the size of the data, we chose to use Apache Spark for the feature generation. The usage of the tool is limited within my department and one another department. The two departments need to deal with long dataset and the other departments does not need that.

Pros and Cons

quick
utilized CPU cores
trendy

lack of support
memory hungry
slow on wide data

Likelihood to Recommend

I would recommend Apache Spark to the colleague if that person is working with long but narrow dataset. This would be a great tool to help the person fully utilize the CPU cores and speed up the work process. However, I would not recommend this tool if the dataset is wide not not very large.

May 03, 2021

Apache Spark - your go to technology for distributed data processing

Surendranatha Reddy Chappidi

Senior Data Engineer

A.P. Moller - Maersk (Logistics & Supply Chain, 10,001+ employees)

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
We use Apache Spark to source that from different source systems, process it, and store it in the data lake.
Once the data is in data lake, we use spark for data cleansing and data transformation as per business requirements
Once the data is transformed, then we will insert it into the final target layer in the data warehouse.

Pros and Cons

Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
Robust and fault tolerant
Open source
Can source data from multiple data sources

No Dataset API support in python version of spark
Apache Spark job run UI can have more meaningful information
Spark errors can provide more meaningful information when a job is failed

Likelihood to Recommend

Specific scenarios where Apache Spark is well suited:
1. real-time processing of streaming data
2. processing unstructured data, semi-structured data, and structured data from multiple sources
3. avoid vendor lock-in and cloud platform lock-in while developing products

November 08, 2020

Epic Computation Engine Framework

Chetan Munegowda

Software Engineer

SemanticBits (Information Technology and Services, 201-500 employees)

Score 9 out of 10

Vetted Review

Verified User

Use Cases and Deployment Scope

Apache Spark is being used by our organization for writing ETL applications. It enables us to ingest thousands of records of data to database tables.

Pros and Cons

Great computing engine for solving complex transformative logic
Useful for understanding data and doing data analytical work
Gives us a great set of libraries and api to solve day-to-day problems

High learning curve
Complexity
More documentation
More developer support
More educational videos

Likelihood to Recommend

Apache Spark is suited for big data applications when there is a need for performing analysis, streaming data work, and ETL work.

September 19, 2020

A powerhouse processing engine.

Verified User

Engineer in Information Technology

Information Technology & Services Company, 11-50 employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We use Apache Spark for cluster computing in large-scale data processing, ETL functions, machine learning, as well as for analytics. Its primarily used by the Data Engineering Department, in order to support the data lake infrastructure. It helps us to effectively manage the great amounts of data that come from our clusters, ensuring the capacity, scalability, and performance needed.

Pros and Cons

Speed: Apache Spark has great performance for both streaming and batch data
Easy to use: the object oriented operators make it easy and intuitive.
Multiple language support
Fault tolerance
Cluster managment
Supports DF, DS, and RDDs

Hard to learn, documentation could be more in-depth.
Due to it's in-memory processing, it can take a large consumption of memory.
Poor data visualization, too basic.

Likelihood to Recommend

Well suited for: large datasets, fault tolerance, parallel processing, ETL, batch processing, streaming, analytics, graphing, or machine learning. Mostly any kind of large-scale processing, since it will save you a lot of time (days of processing). Less appropriate for: smaller datasets, you are better off using pandas or other libraries.

January 12, 2020

Apache Spark -- The best big data solution

Yogesh Mhasde

Technical Manager

Rishabh Software Private Limited (Information Technology & Services, 501-1000 employees)

Score 8 out of 10

Vetted Review

Verified User

Use Cases and Deployment Scope

We were working for one of our products, which has a requirement for developing an enterprise-level product catering to manage a vast amount of Big data involved. We wanted to use a technology that is faster than Hadoop and can process large scale data by providing a streamlined process for the data scientists. Apache Spark is a powerful unified solution as we thought to be.
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.

Pros and Cons

DataFrames, DataSets, and RDDs.
Spark has in-built Machine Learning library which scales and integrates with existing tools.

The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.

Likelihood to Recommend

1. Suitable where the requirement for advanced analytics is prominent.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.

Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.

December 13, 2019

Great open source tool for data processing

Verified User

Executive in Information Technology

Consumer Goods Company, 10,001+ employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We do use Apache Spark for cluster computing for our ETL environment, data and analytics as well as machine learning. It is mainly used by our data engineering team to support the entire Data Lake foundation. As we have huge amounts of information coming from multiple sources, we needed an effective cluster management system to handle capacity and deliver the performance and throughput we needed.

Pros and Cons

Cluster management for ETL.
Data processing engine for our data lake.

You still need Hive or other HDFS to store information.
Security is behind compared to MapReduce.

Likelihood to Recommend

Spark is a one-size-fits-all data processing platform. You can run batch and in-motion streams, you can use for ETL, machine learning or even graphs. You do not have multiple tools, so it makes your TCO and management tasks way easier. As every new platform, has room to grow: storage and security are the main opportunities we found.

March 27, 2019

Want to save dollars, resources and time processing big data, switch to Apache Spark

Verified User

Engineer in Engineering

Management Consulting Company, 51-200 employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We sold a data science product to one of the leading US-based e-commerce firms. Suddenly, their data started growing at a very fast rate. The product, at this stage, was based on R programming. With such huge data, the product started taking a lot of time. We then started thinking of an alternative to R, to process multiplying big data such as this client has. We eventually came across Apache Spark. With the permission of the client, we started switching the codes from R to Apache Spark. It took a very long time to learn and code in Spark, but it was worth the effort. The R codes, which were taking days of time to run, came down to a few hours.

Pros and Cons

Very good tool to process big datasets.
Inbuilt fault tolerance.
Supports multiple languages.
Supports advanced analytics.
A large number of libraries available -- GraphX, Spark SQL, Spark Streaming, etc.

Very slow with smaller amounts of data.
Expensive, as it stores data in memory.

Likelihood to Recommend

If your data is very huge, I recommend converting the underlying technology into Apache Spark. This will save you a lot of time and effort in the near future due to your growing data. The Apache Spark scalability feature also means it handles all the future data related processing.

March 16, 2019

Apache Spark Review

Verified User

Analyst in Information Technology

Computer Networking Company, 1001-5000 employees

Score 7 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We used Apache Spark within our department as a Solution Architecture team. It helped make big data processing more efficient since the same framework can be used for batch and stream processing.

Pros and Cons

Customizable, it integrates with Jupyter notebooks which was really helpful for our team.
Easy to use and implement.
It allows us to quickly build microservices.

Release cycles can be faster.
Sometimes it kicked some of the users out due to inactivity.

Likelihood to Recommend

It is beneficial to use Apache Spark if:

You are working with big data, preprocessing data before machine learning
Building simple microservices and creating PoC. It makes it easier to create REST and simple web APIs.
If you need great customer service, Apache Spark would be a great choice since they provide it 24/7.

December 14, 2018

Apache Spark - defacto for big data processing/analytics

Shiv Shivakumar

Acquisitions Leader

Abbott (Computer Software, 1001-5000 employees)

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

Used as the in memory data engine for big data analytics, streaming data and SQL workloads. Also, in the process of trying it out for certain machine learning algorithms. It basically processes data for analytical needs of the business and is a great tool to co-exist with the hadoop file systems.

Pros and Cons

in memory data engine and hence faster processing
does well to lay on top of hadoop file system for big data analytics
very good tool for streaming data

could do a better job for analytics dashboards to provide insights on a data stream and hence not have to rely on data visualization tools along with spark
also there is room for improvement in the area of data discovery

Likelihood to Recommend

Apache Spark is very well suited for big data analytics in conjunction with the hadoop file system and also does a good job of providing fast access to data in SQL workloads since it has an in memory data processing engine that can very quickly process data. In addition, it can also be used for streaming data processing.

August 28, 2018

Very useful application for Big Data processing and excellent for large volume production workflows

Carla Borges

Consultor Tecnico - Java Developer and Php Developer.

Consultec-TI (Computer Software, 51-200 employees)

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

Apache Spark is being used by the whole organization. It helps us a lot in the transmission of data, as it is 100 times faster than Hadoop MapReduce in memory and 10 times faster in disk, as we work with Java this application. It allows native links for Java programming languages, and as it is compatible with SQL, is completely adapted to the needs of our organization, because of the large amount of information that we use. We highly prefer Apache Spark since it supports in-memory processing to increase performance of big data analysis applications.

Pros and Cons

It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
Apache Spark uses the data storage model of Hadoop and can be integrated with other big data frameworks such as HBase, MongoDB, and Cassandra. This is very useful because it is compatible with multiple frameworks that the company has, and thus allows us to unify all the processes.

Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
It should be more attentive to users and make tutorials, to reduce the learning curve.
There should be more grouping algorithms.

Likelihood to Recommend

It is suitable for processing large amounts of data, as it is very easy to use and its syntax is simple and understandable. I also find it useful to use in a variety of applications without the need to integrate many other processing technologies, and it is very fast and has many machine learning algorithms that can be used for data problems. I find it less appropriate for data that is not so large, as it uses too many resources.

July 21, 2018

Apache Spark: One stop shop for distributed data processing, machine learning and graph processing

Nitin Pasumarthy

Software Engineer

LinkedIn (Internet, 5001-10,000 employees)

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We use Apache Spark across all analytics departments in the company. We primarily use it for distributed data processing and data preparation for machine learning models. We also use it while running distributed CRON jobs for various analytical workloads. I am familiar with a story where we contributed an algorithm to Spark open source which is on Random Walks in Large Graphs - https://databricks.com/session/random-walks-on-large-scale-graphs-with-apache-spark

Pros and Cons

Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
Faster in execution times compare to Hadoop and PIG Latin
Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
Interoperability between SQL and Scala / Python style of munging data

Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment

Likelihood to Recommend

Apache Spark has rich APIs for regular data transformations or for ML workloads or for graph workloads, whereas other systems may not such a wide range of support. Choose it when you need to perform data transformations for big data as offline jobs, whereas use MongoDB-like distributed database systems for more realtime queries.

June 07, 2018

My Apache Spark Review

Kartik Chavan

Data Analyst

The University of Texas at Arlington (Electrical/Electronic Manufacturing, 1001-5000 employees)

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

My company uses Apache Spark in various ways including machine learning, analytics and batch processing. [We] Grab the data from other sources and put it into a Hadoop environment. [We] Build data lakes. SparkSQL is also used for analysis of data and to develop reports. We have deployed the clusters in Cloudera. Because of Apache Spark, it has become very easy to apply data science in a big data field.

Pros and Cons

Easy ELT Process
Easy clustering on cloud
Amazing speed
Batch & real time processing

Debugging is difficult as it is new for most people
There are fewer learning resources

Likelihood to Recommend

When the data is very big, and you cannot afford a lot of computational timing such as in a real-time environment, it is advisable to use Apache Spark. There are alternatives to Apache Spark, but it is the most common and robust tool to work with. It is great at batch processing.

March 27, 2018

Apache Spark, the be all End All.

Anson Abraham

Data Czar

Envisagenics, Inc. (Marketing and Advertising, 51-200 employees)

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

Spark was/is being used in myriad of ways. With Kafka, using Spark Streams to grab data from kafka queue into our hdfs environment. SparkSQL used for analysis of data for those not familiar with spark. Using Spark for data analysis as well and for main workflow process. Using spark over mapreduce. Using Spark for some machine learning algo's with the data.

Pros and Cons

Machine Learning.
Data Analysis
WorkFlow process (faster than MapReduce).
SQL connector to multiple data sources

Memory management. Very weak on that.
PySpark not as robust as scala with spark.
spark master HA is needed. Not as HA as it should be.
Locality should not be a necessity, but does help improvement. But would prefer no locality

Likelihood to Recommend

Spark is great as a workflow process and extract transform layer process tool. Is really good for machine learning especially for large datasets that can be processed in split file paralallelization.
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.

January 23, 2018

Use Apache Spark to Speed Up Cluster Computing

Verified User

Employee in Engineering

Hospital & Health Care Company, 501-1000 employees

Score 7 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

In our company, we used Spark for a healthcare analytical project, where we need to do large-scale data processing in a Hadoop environment. The project is about building an enterprise data lake where we bring data from multiple products and consolidate. Further, in the downstream, we will develop some business reports.

Pros and Cons

We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
Spark supports both batch and real-time processing
Apache Spark has Machine Learning Algorithms support

Consumes more memory
Difficult to address issues around memory utilization
Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data

Likelihood to Recommend

Well suited:
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms

December 13, 2017

Apache Spark Should Spark Your Interest

Verified User

Engineer in Engineering

Computer Software Company, 51-200 employees

Score 9 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

At my current company, we are using Spark in a variety of ways ranging from batch processing to data analysis to machine learning techniques. It has become our main driver for any distributed processing applications. It has gained quick adoption across the organization for its ease of use, integration into the Hadoop stack, and for its support in a variety of languages.

Pros and Cons

Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.

Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.

Likelihood to Recommend

If you are running a distributed environment and are running applications that make use of batch processing, analytics, streaming, machine learning, or graphing then I cannot recommend Spark enough. It is easy to get going, simple to learn (relative to similar technologies), and can be used in a variety of use cases. All while giving you great performance.

October 26, 2017

Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing

Kamesh Emani

Software Developer Intern

Louisiana Tech University (Higher Education, 1001-5000 employees)

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization

Pros and Cons

Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.

Data visualization.
Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
Transformations and actions available are limited so must modify API to work for more features.

Likelihood to Recommend

For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not

August 02, 2017

Apache Spark if great for high volume production workflows

Verified User

Director in Engineering

Computer Software Company, 10,001+ employees

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We use it primarily in our department as part of a machine learning and data processing platform to build enterprise scale predictive applications.

Pros and Cons

Great APIs and tools.
Scale.
Speed for iterative algorithms.

No true streaming.
Lack of strongly typed yet convenient APIs.

Likelihood to Recommend

Well suited for batch and near-real time data processing tasks as well as production deployments of machine learning, especially at large scale. Not well suited for general analytics workflows for small and medium sized data sets; SQL based data warehouses like Redshift, Vertica, and etc. are better for those use cases.

June 26, 2017

Sparkling Spark

Sunil Dhage

Big Data Analyst

PSL Group (Pharmaceuticals, 51-200 employees)

Score 10 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

It's being replaced as the traditional ETL tool and we are using Apache Spark for data science solutions.

Pros and Cons

It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
It's very fast and has many machine learning algorithms which can be used for data science problems.
It is easily implemented on a cloud cluster.

The initialization and spark context procedures.
Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
Debugging and Testing are sometimes time-consuming.

Likelihood to Recommend

It's well suited for ETL, data Integration, and data science problems of large data sets. It's not at all suitable for small data sets which can be done on desktops and laptops using the Python tool.

September 12, 2016

A useful replacement for MapReduce for Big Data processing

Jordan Moore

Staff Consultant

Avalon Consulting, LLC (Information Technology and Services, 51-200 employees)

Score 8 out of 10

Vetted Review

Verified User

Incentivized

Use Cases and Deployment Scope

We are learning core Apache Spark + SparkSQL and MLLib, while creating proof-of-concepts as well as providing solutions for clients. It addresses the needs of quickly processing large amounts of data, typically located in Hadoop.

Pros and Cons

Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.

For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.

Likelihood to Recommend

On the plus side, Spark is a good tool to learn to apply to various data processing problems.

As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.

Return to navigation

Spark Project | Spark Tutorial | Online Spark Training | Intellipaat

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn

Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka

Introduction to Databricks [New demo linked in description]

Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat

Hadoop

Apache Hive

Elasticsearch

Google BigQuery

Snowflake

Presto

MongoDB

Scala

Hive

TensorFlow

Community Insights