Overview
Apache Spark: Lightning-Fast Distributed Computing with a Learning Curve
Lightning Fast In-Memory Cluster Computing Framework
Apache Spark is the next generation of big data computing.
Apache Spark in Telco
good solution for long and narrow data
Apache Spark - your go to technology for distributed data processing
- We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
- We use …
Epic Computation Engine Framework
A powerhouse processing engine.
Apache Spark -- The best big data solution
Great open source tool for data processing
Want to save dollars, resources and time processing big data, switch to Apache Spark
Apache Spark Review
Apache Spark - defacto for big data processing/analytics
Very useful application for Big Data processing and excellent for large volume production workflows
Apache Spark: One stop shop for distributed data processing, machine learning and graph processing
Reviewer Pros & Cons
Product Demos
Spark Project | Spark Tutorial | Online Spark Training | Intellipaat
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
Apache Spark Full Course | Apache Spark Tutorial For Beginners | Learn Spark In 7 Hours |Simplilearn
Apache Spark Architecture | Spark Cluster Architecture Explained | Spark Training | Edureka
Introduction to Databricks [New demo linked in description]
Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat
Product Details
- About
- Tech Details
What is Apache Spark?
Apache Spark Technical Details
Operating Systems | Unspecified |
---|---|
Mobile Application | No |
Comparisons
Compare with
Reviews and Ratings
(158)Community Insights
- Business Problems Solved
- Pros
- Cons
- Recommendations
Apache Spark is an incredibly versatile tool that has been widely adopted across various departments for processing very large datasets and generating summary statistics. Users have found it particularly useful for creating simple graphics when working with big data, making it a valuable asset for analytics departments. It is also used extensively in the banking industry to calculate risk-weighted assets on a daily and monthly basis for different positions. The integration of Apache Spark with Scala and Apache Spark clusters enables users to load and process large volumes of data, implementing complex formulas and algorithms. Additionally, Apache Spark is often utilized alongside Kafka and Spark Streams to extract data from Kafka queues into HDFS environments, allowing for streamlined data analysis and processing.
One of the key strengths of Apache Spark lies in its ability to handle large volumes of retail and eCommerce data, providing cost and performance benefits over traditional RDBMS solutions. This makes it a preferred choice for companies in these industries. Furthermore, Apache Spark plays a crucial role in supporting data-driven decision-making by digital data teams. Its capabilities allow these teams to build data products, source data from different systems, process and transform it, and store it in data lakes.
Apache Spark is highly regarded for its ability to perform data cleansing and transformation before inserting it into the final target layer in data warehouses. This makes it a vital tool for ensuring the accuracy and reliability of data. Its faster data processing capabilities compared to Hadoop MapReduce have made Apache Spark a go-to choice for tasks such as machine learning, analytics, batch processing, data ingestion, and report development. Moreover, educational institutions rely on Apache Spark to optimize scheduling by assigning classrooms based on student course enrollment and professor schedules.
Overall, Apache Spark proves itself as an indispensable product that meets the needs of various industries by offering efficient distributed data processing, advanced analytics capabilities, and seamless integration with other technologies. Its versatility allows it to support a wide range of use cases, making it an essential tool for anyone working with big data.
Great Computing Engine: Apache Spark is praised by many users for its capabilities in handling complex transformative logic and sophisticated data processing tasks. Several reviewers have mentioned that it is a great computing engine, indicating its effectiveness in solving intricate problems.
Valuable Insights and Analysis: Many reviewers find Apache Spark to be useful for understanding data and performing data analytical work. They appreciate the valuable insights and analysis capabilities provided by the software, suggesting that it helps them gain deeper understanding of their data.
Extensive Set of Libraries and APIs: The extensive set of libraries and APIs offered by Apache Spark has been highly appreciated by users. It provides a wide range of tools and functionalities to solve various day-to-day problems, making it a versatile choice for different data processing needs.
Challenging to Understand and Use: Some users have found Apache Spark to be challenging to understand and use for modeling big data. They struggle with the complexity of the software, leading to a high learning curve.
Lack of User-Friendliness: The software is considered not user-friendly, with a confusing user interface and graphics that are not of high quality. This has resulted in frustration among some users who find it difficult to navigate and work with.
Time-Consuming Processing: Apache Spark can be time-consuming when processing large data sets across multiple nodes. This has been reported by several users who have experienced delays in their data processing tasks, affecting overall efficiency.
When using Spark for big data tasks, users commonly recommend familiarizing yourself with the documentation and gaining experience. They emphasize investing time in reading and understanding the documentation to overcome any initial challenges. As users gain experience, they find working with Spark becomes easier and more efficient.
Users also suggest utilizing Spark specifically for true big data problems, where its capabilities and performance shine. They highlight that Spark is well-suited for tackling large-scale data processing tasks.
Additionally, users find value in leveraging the R and Python APIs in Spark. These APIs allow them to work with Spark using familiar programming languages such as R and Python, making it easier to analyze and process data.
Overall, users advise diving into the documentation, utilizing Spark for big data challenges, and leveraging the R and Python APIs to enhance their experience with Spark.
Attribute Ratings
Reviews
(1-23 of 23)- Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
- Scalable to any extent.
- Has built-in machine learning library called - MLlib
- Very flexible - data from various data sources can be used. Usage with HDFS is very easy
- Its fully not backward compatible.
- It is memory-consuming for heavy and large workloads and datasets
- Support for advanced analytics is not available - MLlib has minimalistic analytics.
- Deployment is a complex task for beginners.
Less appropriate: We had to work on a RecSys where the music dataset that we used was around 300+Gb in size. We faced memory-based issues. Few times we also got memory errors. Also the MLlib library does not have support for advanced analytics and deep-learning frameworks support. Understanding the internals of the working of Apache Spark for beginners is highly not possible.
Lightning Fast In-Memory Cluster Computing Framework
- Realtime data processing
- Interactive Analysis of data
- Trigger Event Detection
- Machine Learning
- GraphX Lib
- True Realtime Streaming
Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.
- DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
- Calculation in-memory.
- Cluster to distribute large data of calculation.
- It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.
Apache Spark in Telco
- Machine learning on big data
- Stream processing
- Lakehouse with Delta
- Indexing
- Mllib
- Streaming
good solution for long and narrow data
- quick
- utilized CPU cores
- trendy
- lack of support
- memory hungry
- slow on wide data
- We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
- We use Apache Spark to source that from different source systems, process it, and store it in the data lake.
- Once the data is in data lake, we use spark for data cleansing and data transformation as per business requirements
- Once the data is transformed, then we will insert it into the final target layer in the data warehouse.
- Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
- Robust and fault tolerant
- Open source
- Can source data from multiple data sources
- No Dataset API support in python version of spark
- Apache Spark job run UI can have more meaningful information
- Spark errors can provide more meaningful information when a job is failed
1. real-time processing of streaming data
2. processing unstructured data, semi-structured data, and structured data from multiple sources
3. avoid vendor lock-in and cloud platform lock-in while developing products
Epic Computation Engine Framework
- Great computing engine for solving complex transformative logic
- Useful for understanding data and doing data analytical work
- Gives us a great set of libraries and api to solve day-to-day problems
- High learning curve
- Complexity
- More documentation
- More developer support
- More educational videos
A powerhouse processing engine.
- Speed: Apache Spark has great performance for both streaming and batch data
- Easy to use: the object oriented operators make it easy and intuitive.
- Multiple language support
- Fault tolerance
- Cluster managment
- Supports DF, DS, and RDDs
- Hard to learn, documentation could be more in-depth.
- Due to it's in-memory processing, it can take a large consumption of memory.
- Poor data visualization, too basic.
Apache Spark -- The best big data solution
The main problem that we identified in our existing approach was that it was taking a large amount of time to process the data, and also the statistical analysis of the data was not up to the mark. We wanted a sophisticated analytical solution that was easy and fast to use. With using Apache Spark, the processing was made 5 times faster than earlier, giving rise to pretty good analytics. With Spark, across a cluster of machines, the data abstraction was achieved by using RDDs.
- DataFrames, DataSets, and RDDs.
- Spark has in-built Machine Learning library which scales and integrates with existing tools.
- The data processing done by Spark comes at a price of memory blockages, as in-memory capabilities of processing can lead to large consumption of memory.
- The caching algorithm is not in-built in Spark. We need to manually set up the caching mechanism.
2. When you want big data to be processed at a very fast pace.
3. For large datasets, Spark is a viable solution.
4. When you need fault tolerance to be at a precision, go for Spark.
Spark is not suitable:
1. If you want your data to be processed in real-time, then Spark is not a good solution.
2. When you need automatic optimization, then Spark fails at that point.
Great open source tool for data processing
- Cluster management for ETL.
- Data processing engine for our data lake.
- You still need Hive or other HDFS to store information.
- Security is behind compared to MapReduce.
- Very good tool to process big datasets.
- Inbuilt fault tolerance.
- Supports multiple languages.
- Supports advanced analytics.
- A large number of libraries available -- GraphX, Spark SQL, Spark Streaming, etc.
- Very slow with smaller amounts of data.
- Expensive, as it stores data in memory.
Apache Spark Review
- Customizable, it integrates with Jupyter notebooks which was really helpful for our team.
- Easy to use and implement.
- It allows us to quickly build microservices.
- Release cycles can be faster.
- Sometimes it kicked some of the users out due to inactivity.
- You are working with big data, preprocessing data before machine learning
- Building simple microservices and creating PoC. It makes it easier to create REST and simple web APIs.
- If you need great customer service, Apache Spark would be a great choice since they provide it 24/7.
Apache Spark - defacto for big data processing/analytics
- in memory data engine and hence faster processing
- does well to lay on top of hadoop file system for big data analytics
- very good tool for streaming data
- could do a better job for analytics dashboards to provide insights on a data stream and hence not have to rely on data visualization tools along with spark
- also there is room for improvement in the area of data discovery
Very useful application for Big Data processing and excellent for large volume production workflows
- It performs a conventional disk-based process when the data sets are too large to fit into memory, which is very useful because, regardless of the size of the data, it is always possible to store them.
- It has great speed and ability to join multiple types of databases and run different types of analysis applications. This functionality is super useful as it reduces work times
- Increase the information and trainings that come with the application, especially for debugging since the process is difficult to understand.
- It should be more attentive to users and make tutorials, to reduce the learning curve.
- There should be more grouping algorithms.
Apache Spark: One stop shop for distributed data processing, machine learning and graph processing
- Rich APIs for data transformation making for very each to transform and prepare data in a distributed environment without worrying about memory issues
- Faster in execution times compare to Hadoop and PIG Latin
- Easy SQL interface to the same data set for people who are comfortable to explore data in a declarative manner
- Interoperability between SQL and Scala / Python style of munging data
- Documentation could be better as I usually end up going to other sites / blogs to understand the concepts better
- More APIs are to be ported to MLlib as only very few algorithms are available at least in clustering segment
My Apache Spark Review
- Easy ELT Process
- Easy clustering on cloud
- Amazing speed
- Batch & real time processing
- Debugging is difficult as it is new for most people
- There are fewer learning resources
Apache Spark, the be all End All.
- Machine Learning.
- Data Analysis
- WorkFlow process (faster than MapReduce).
- SQL connector to multiple data sources
- Memory management. Very weak on that.
- PySpark not as robust as scala with spark.
- spark master HA is needed. Not as HA as it should be.
- Locality should not be a necessity, but does help improvement. But would prefer no locality
Spark streaming is scalable for close to real-time data workflow process.
what it's not good for, is smaller subset of data processing.
Use Apache Spark to Speed Up Cluster Computing
- We used to make our batch processing faster. Spark is faster in batch processing than MapReduce with it in memory computing
- Spark will run along with other tools in the Hadoop ecosystem including Hive and Pig
- Spark supports both batch and real-time processing
- Apache Spark has Machine Learning Algorithms support
- Consumes more memory
- Difficult to address issues around memory utilization
- Expensive - In-memory processing is expensive when we look for a cost-efficient processing of big data
1. Data can be integrated from several sources including click stream, logs, transactional systems
2. Real-time ingestion through Kafka, Kinesis, and other streaming platforms
Apache Spark Should Spark Your Interest
- Ease of use, the Spark API allows for minimal boilerplate and can be written in a variety of languages including Python, Scala, and Java.
- Performance, for most applications we have found that jobs are more performant running via Spark than other distributed processing technologies like Map-Reduce, Hive, and Pig.
- Flexibility, the frameworks comes with support for streaming, batch processing, sql queries, machine learning, etc. It can be used in a variety of applications without needing to integrate a lot of other distributed processing technologies.
- Resource heavy, jobs, in general, can be very memory intensive and you will want the nodes in your cluster to reflect that.
- Debugging, it has gotten better with every release but sometimes it can be difficult to debug an error due to ambiguous or misleading exceptions and stack traces.
Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing
It is used by a department but the data consists of information about students and professors of the whole organization.
It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
- Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
- It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
- It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
- I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
- Data visualization.
- Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
- Transformations and actions available are limited so must modify API to work for more features.
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not
Apache Spark if great for high volume production workflows
- Great APIs and tools.
- Scale.
- Speed for iterative algorithms.
- No true streaming.
- Lack of strongly typed yet convenient APIs.
Sparkling Spark
- It makes the ETL process very simple when compared to SQL SERVER and MYSQL ETL tools.
- It's very fast and has many machine learning algorithms which can be used for data science problems.
- It is easily implemented on a cloud cluster.
- The initialization and spark context procedures.
- Running applications on a cluster is not well documented anywhere, some applications are hard to debug.
- Debugging and Testing are sometimes time-consuming.
A useful replacement for MapReduce for Big Data processing
- Scale from local machine to full cluster. You can run a standalone, single cluster simply by starting up a Spark Shell or submitting an application to test an algorithm, then it quickly can be transferred and configured to run in a distributed environment.
- Provides multiple APIs. Most people I know use Python and/or Java as their main programming language. Data scientists who are familiar with NumPy and SciPy can quickly become comfortable with Spark, while Java developers would best served using Java 8 and the new features that it provides. Scala, on the other hand, is a mix between the Java and Python styles of writing Spark code, in my opinion.
- Plentiful learning resources. The Learning Spark book is a good introduction to the mechanics of Spark although written for Spark 1.3, and the current version is 2.0. The GitHub repository for the book contains all the code examples that are discussed, plus the Spark website is also filled with useful information that is simple to navigate.
- For data that isn't truly that large, Spark may be overkill when the problem could likely be solved on a computer with reasonable hardware resources. There doesn't seem to be a lot of examples for how a Spark task would otherwise be implemented in a different library; for instance scikit-learn and NumPy rather than Spark MLlib.
As described in the Cons - Spark may not be needed unless there is truly a large amount of data to operate on. Other libraries may be better suited for the same task.