What is Big Data?

Big data is a term that describes the large volume of data - both structured and unstructured - that inundates a business on a day-to -day basis. But it's nnot the amount of data that's important. 

It's what organizations do with the data that matters. 

Big data can be analyzed for insights that lead to better decisions and stratedgic business moves. 

Cluster computing

It's a fancy term for computing using a 'cluster' of pooled resources of multiple servers. 

Getting more technical, we might be talking about nodes, cluster management layer, load balancing, and paralllel processing, etc. 

Data mining

Data mining is about finding meaninful patterns and deriving insights in large sets of data using sophisticated patterns and deriving insights in large sets of data using sophisticated pattern recognition technics. It is closely related to the term Analytics. 

To derive meaningful patterns, data miners use statistics, machine learning algorithms, and artificial intelligence. 

Dark Data

Basically, this refers to all the data that is gathered and processed by enterprises not used for any meaningful purposes and hence it is 'dark' and may never be analyzed. 

It could be social network feeds, call center logs, meeting notes. 

ETL= Extract, Transform, Load

It refers to the process of 'extracting' raw data, 'transforming' by cleaning/enriching the data for 'fit for use' and 'loading' into the appropiate repository for the system's use. 

Even though it originated with data warehouses, ETL processes are used while 'ingesting' i.e. taking/absorbing data from external sources in big data systems. 

Distributed File Systems

As big data is too large to store on a single system. Distributed File System is a data storage system meant to store large volumes of data across multiple storage devices and will help decrease the cost and complexity of storing large amounts of data. 


MapReduce could be little bit confusing.

MapReduce is a programming model and the best way to understand this is to note that Map and Reduce are two separate items. 

In this, the programming model first breaks up the bif data dataset into pieces so it can be distributed across different computers in different computers in different locations, which is essentially the Map part. Then the model collects the results and 'reduces' them into one report. 

Batch Processing

Even though batch data processing has been around since mainframe days, it gained additional significance with Big Data given the large data sets that it deals with. 

Batch data processing is an efficient way of processing high volumes of data where a group of trasactions is collected over a period of time. 

Data Scientist

Term used to describe an expert in extracting insights and value from data. 

It is usually someone that has skills in analytics, computer science, mathematics, statistics, creativity, data visualization and communication as well as business and strategy. 

Stream processing

Stream processing is designed to act o real-time and streaming data with "contiuous" queries. 

Combined with streaming analytics i.e. the abiliity to continuously calculate mathematical or statistical analytics on the fly within the stream, stream processing solutions are designed to handle very high volume in real time. 

Source: Data Science App