#### Differentiate between Data Analytics and Data Science

Data Analytics |
Data Science |

Data Analytics is a subset of Data Science. | Data Science is a broad technology that includes various subsets such as Data Analytics, Data Mining, Data Visualization, etc. |

The goal of data analytics is to illustrate the precise details of retrieved insights. | The goal of data science is to discover meaningful insights from massive datasets and derive the best possible solutions to resolve business issues. |

Requires just basic programming languages. | Requires knowledge in advanced programming languages. |

It focuses on just finding the solutions. | Data Science not only focuses on finding the solutions but also predicts the future with past patterns or insights. |

A data analyst’s job is to analyse data in order to make decisions. | A data scientist’s job is to provide insightful data visualizations from raw data that are easily understandable. |

#### What is the difference between the long format data and wide format data?

Long Format Data |
Wide Format Data |

A long format data has a column for possible variable types and a column for the values of those variables. | Whereas, Wide data has a column for each variable. |

Each row in the long format represents one time point per subject. As a result, each topic will contain many rows of data. | The repeated responses of a subject will be in a single row, with each response in its own column, in the wide format. |

This data format is most typically used in R analysis and for writing to log files at the end of each experiment. | This data format is most widely used in data manipulations, stats programmes for repeated measures ANOVAs and is seldom used in R analysis. |

A long format contains values that do repeat in the first column. | A wide format contains values that do not repeat in the first column. |

Use df.melt() to convert wide form to long form | use df.pivot().reset_index() to convert long form into wide form |

#### What is bias in Data Science?

Bias is a type of error that occurs in a Data Science model because of using an algorithm that is not strong enough to capture the underlying patterns or trends that exist in the data. In other words, this error occurs when the data is too complicated for the algorithm to understand, so it ends up building a model that makes simple assumptions. This leads to lower accuracy because of underfitting. Algorithms that can lead to high bias are linear regression, logistic regression, etc.==

#### Why is Python used for Data Cleaning in DS?

Data Scientists have to clean and transform the huge data sets in a form that they can work with. It is important to deal with the redundant data for better results by removing nonsensical outliers, malformed records, missing values, inconsistent formatting, etc.

Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively used for Data cleaning and analysis. These libraries are used to load and clean the data and do effective analysis. For example, a CSV file named “Student” has information about the students of an institute like their names, standard, address, phone number, grades, marks, etc.

#### What is variance in Data Science?

Variance is a type of error that occurs in a Data Science model when the model ends up being too complex and learns features from data, along with the noise that exists in it. This kind of error can occur if the algorithm used to train the model has high complexity, even though the data and the underlying patterns and trends are quite easy to discover. This makes the model a very sensitive one that performs well on the training dataset but poorly on the testing dataset, and on any kind of data that the model has not yet seen. Variance generally leads to poor accuracy in testing and results in overfitting

#### What is pruning in a decision tree algorithm?

Pruning a decision tree is the process of removing the sections of the tree that are not necessary or are redundant. Pruning leads to a smaller decision tree, which performs better and gives higher accuracy and speed.

#### What is an RNN (recurrent neural network)?

A recurrent neural network, or RNN for short, is a kind of Machine Learning algorithm that makes use of the artificial neural network. RNNs are used to find patterns from a sequence of data, such as time series, stock market, temperature, etc. RNNs are a kind of feedforward network, in which information from one layer passes to another layer, and each node in the network performs mathematical operations on the data. These operations are temporal, i.e., RNNs store contextual information about previous computations in the network. It is called recurrent because it performs the same operations on some data every time it is passed. However, the output may be different based on past computations and their results.

#### What is a kernel function in SVM?

In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.

#### Explain bagging in Data Science.

Bagging is an ensemble learning method. It stands for bootstrap aggregating. In this technique, we generate some data using the bootstrap method, in which we use an already existing dataset and generate multiple samples of the *N* size. This bootstrapped data is then used to train multiple models in parallel, which makes the bagging model more robust than a simple model.

Once all the models are trained, when it’s time to make a prediction, we make predictions using all the trained models and then average the result in the case of regression, and for classification, we choose the result, generated by models, that have the highest frequency.

#### Explain how Machine Learning is different from Deep Learning.

A field of computer science, Machine Learning is a subfield of Data Science that deals with using existing data to help systems automatically learn new skills to perform different tasks without having rules to be explicitly programmed.

Deep Learning, on the other hand, is a field in Machine Learning that deals with building Machine Learning models using algorithms that try to imitate the process of how the human brain learns from the information in a system for it to attain new capabilities. In Deep Learning, we make heavy use of deeply connected neural networks with many layers.