Glossary

What is a transaction? In the context of databases and data storage systems, a transaction is any operation that is treated as a single unit of work, which either completes fully or does not complete at all, and leaves the storage system in a consistent state. The classic example o {. . .}
Gradient descent is the most commonly used optimization method deployed in machine learning and deep learning algorithms. It’s used to train a machine learning model. Types of Gradient Descent {. . .}
Alternative data is information gathered by using alternative sources of data that others are not using;  non-traditional information sources. Analysis of alternative data can provide insights beyond that which an industry’s regular data sources are capable of providing. However, what exactly {. . .}
Anomaly Detection is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations. Such “anomalous” behavior typically translates to some kind of a problem like credit card fraud, a failing machine, or a cy {. . .}
What is Apache Hive? The Apache Hive™ is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL on top of Apa {. . .}
Apache Kudu is a free and open source columnar storage system developed for the Apache Hadoop. It is an engine intended for structured data that supports low-latency random access millisecond-scale access to individual rows together with great analytical access pattern {. . .}
Apache Kylin is a distributed open source online analytics processing (OLAP) engine for interactive analytics Big Data. Apache Kylin has been designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark. In addition, it easily integrates with BI tools via ODBC driver, JDBC {. . .}
What Is Apache Spark? Apache Spark is an open source analytics engine used for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Apache Spark started in 2009 as a research project at the University of California, Berkeley. {. . .}
Apache Spark is an open source cluster computing framework for fast real-time large-scale data processing. Since its inception in 2009 at UC Berkeley’s AMPLab, Spark has seen major growth. It is currently rated as the largest open source communities in big data and it features over 200 contributo {. . .}
An artificial neuron network (ANN) is a computing system patterned after the operation of neurons in the human brain. How Do Artificial Neural Networks Work? Artificial Neural Networks can be best viewed as weighted directed graphs, that are commonly organized in layers. These {. . .}
What is Automation Bias? Automation bias is an over-reliance on automated aids and decision support systems. As the availability of automated decision aids is increasing additions to critical decision-making contexts such as intensive care units, or aircraft cockpits are beco {. . .}
Bayesian Neural Networks (BNNs) refers to extending standard networks with posterior inference in order to control over-fitting. From a broader perspective, the Bayesian approach uses the statistical methodology so that everything has a probability distribution attached to it, including model par {. . .}
The Difference Between Data and Big Data Analytics Prior to the invention of Hadoop, the technologies underpinning modern storage and compute systems were relatively basic, limiting co {. . .}
Bioinformatics is a field of study that uses computation to extract knowledge from large collections of biological data. {. . .}
At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer. Catalyst is based on functional programming constructs in Scala and designed with t {. . .}
Complex event processing [CEP] also known as event, stream or event stream processing is the use of technology for querying data before storing it within a database or, in some cases, without it ever being stored. Complex event processing is an organizational tool that helps to aggregate a lot of {. . .}
Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate systems, such as query serving or interaction wit {. . .}
In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural language processing, signal processing, and various other p {. . .}
A data analytics platform is an ecosystem of services and technologies that needs to perform analysis on voluminous, complex and dynamic data that allows you to retrieve, combine, interact with, explore, and visualize data from the various sources a company might have. A comprehensive data analys {. . .}
A data lake is a central location that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data. Compared to a hierarchical data warehouse which stores data in files or folders, a data lake uses a different approach; it uses a flat arc {. . .}
A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data. Data Lakeho {. . .}
A data warehouse is a system that pulls together data derived from operational systems and external data sources within an organization for reporting and analysis. A data warehouse is a central repository of information that provides users with current and historical decision support information {. . .}
Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. The primary differentiations a {. . .}
What is a DataFrame? A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of s {. . .}
Datasets are a type-safe version of Spark’s structured API for Java and Scala. This API is not available in Python and R, because those are dynamically typed languages, but it is a powerful tool for writing large applications in Scala and Java. Recall that DataFrames are a distributed {. . .}
Deep Learning is a subset of machine learning concerned with large amounts of data with algorithms that have been inspired by the structure and function of the human brain, which is why deep learning models are often referred to as deep neural networks. It is is a part of a broader family of machine {. . .}
Dense tensors store values in a contiguous sequential block of memory where all values are represented. Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications. There are a number of software products that can perform tensor computations, s {. . .}
The DNA sequence is the process of determining the exact sequence of nucleotides of DNA (deoxyribonucleic acid).  Sequencing DNA the order of the four chemical building blocks - adenine, guanine, cytosine, and thymine also known as bases, occur within the DNA molecule. The first methods for sequ {. . .}
Elasticsearch is a NoSQL, distributed database that stores, retrieves, and manages document-oriented and semi-structured data. Furthermore, it is an open source, RESTful search engine built on top of Apache Lucene and released under the terms of the Apache License. It is Java-based, thus available f {. . .}
Genomics is an area within genetics that concerns the sequencing and analysis of an organism’s genome. Its main task is to determine the entire sequence of DNA or the composition of the atoms that make up the DNA and the chemical bonds between the DNA atoms. The field of genomics is interested {. . .}
What is Hadoop? Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads t {. . .}
What Is a Hadoop Cluster? Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in parallel by using an algorithm (like the MapRe {. . .}
What is HDFS? HDFS stands for Hadoop Distributed File System. The function of HDFS is to operate as a distributed file system designed to run on commodity hardware.  HDFS is fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to applica {. . .}
What is the Hadoop Ecosystem? Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library; it includes open source projects as well as a complete range of complementary tools. Some of the most well-known tools of the Hadoop ecosystem incl {. . .}
In computing, a hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. Here are the {. . .}
Hive provides many built-in functions to help us in the processing and querying of data. Some of the functionalities provided by these functions include string manipulation, date manipulation, type conversion, conditional operators, mathematical functions, and several others. Types of Buil {. . .}
Apache Spark is a fast and general cluster computing system for Big Data built around speed, ease of use, and advanced analytics that was originally built in 2009 at UC Berkeley. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation grap {. . .}
Keras is a high-level library for deep learning, built on top of Theano and Tensorflow. It is written in Python and provides a clean and convenient way to create a range of deep learning models. Keras has become one of the most used high-level neural networks APIs when it comes to developing and te {. . .}
Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides access to batch-processing and stream-processing methods with a hybrid approach. Lambda architecture is used to solve the problem of computing arbitrary functions. The lambda architecture its {. . .}
Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and models instead of solving the complexities surround {. . .}
A managed Spark service lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. By using such an automation you will be able to quickly create clusters on -demand, manage them with ease and turn them off when the task is complete. Users c {. . .}
What is MapReduce? MapReduce is a java-based, distributed execution framework within the Apache Hadoop Ecosystem.  It takes away the complexity of distributed programming by exposing two processing steps that developers implem {. . .}
Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and cleaning, extracting features, and training a class {. . .}
Model risk management refers to the supervision of risks from the potential adverse consequences of decisions based on incorrect or misused models. The aim of model risk management is to employ techniques and practices that will identify, measure and mitigate model risks i.e. the possibility of mode {. . .}
A neural network is a computing model whose layered structure resembles the networked structure of neurons in the brain. It features interconnected processing elements called neurons that work together to produce an output function. Neural networks are made of input and output layers/dimensions, an {. . .}
What is Orchestration? Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process. These processes can consist of multiple tasks that are automated and can i {. . .}
Pandas is an open source, BSD-licensed library written for the Python programming language that provides fast and adaptable data structures, and data analysis tools. This easy to use data manipulation tool was originally written by Wes McKinney. It is built on the Numpy package and its key data str {. . .}
Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files. Parquet uses the record shr {. . .}
Predictive analytics is a form of advanced analytics that uses both new and historical data to determine patterns and predict future outcomes and trends. How Does Predictive Analytics Work? Predictive analytics uses many techniques such as statistical analysis techniques, analy {. . .}
PyCharm is an integrated development environment (IDE) used in computer programming, created for the Python programming language. When using PyCharm on Databricks, by default PyCharm creates a Python Virtual Environment, but you can configure to create a Conda environment or use an existing one. {. . .}
Apache Spark is written in Scala programming language. PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python pr {. . .}
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed {. . .}
If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs What are Resilient Distributed Datasets? RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault tolerant, immutable in natur {. . .}
Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or {. . .}
Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can als {. . .}
What is Spark Streaming? Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-ti {. . .}
What is Spark Performance Tuning? Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in S {. . .}
Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little latency. Sparklyr is an effective tool for interfacing with large da {. . .}
SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follows R’s syntax instead of Python. For the most {. . .}
Python offers an inbuilt library called numpy to manipulate multi-dimensional arrays. The organization and use of this library is a primary requirement for developing the pytensor library. {. . .}
How Does Stream Analytics Work? Streaming analytics, also known as event stream processing, is the analysis of huge pools of current and “in-motion” data through the use of continuous queries, called event streams. These streams are triggered by a specific event that happens {. . .}
Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a streaming fashion. This can reduce latency and allow {. . .}
In November of 2015, Google released it's open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters {. . .}
Estimators represent a complete model but also look intuitive enough to less user. The Estimator API provides methods to train the model, to judge the model’s accuracy, and to generate predictions. TensorFlow provides a programming stack consisting of multiple API layers like in the below image {. . .}
In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to {. . .}
Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. Tungsten Project Includes Th {. . .}
Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute resources while Caffe focused on model deployment o {. . .}
Unified Data Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Data Analytics makes it easier for enterprises to build data pipelines acro {. . .}
Databricks' Unified Data Analytics Platform helps organizations accelerate innovation by unifying data science with engineering and business. With Databricks as your Unified Data Analytics Platform, you can quickly prepare and clean data at massive scale with no limitations. The pl {. . .}
A unified database also known as an enterprise data warehouse holds all the business information of an organization and makes it accessible all across the company. Most companies today, have their data managed in isolated silos while different teams of the same organization use various data manag {. . .}