Data Science Notes for Computer Science Students

 

 

1. What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. In simpler terms, data science is about obtaining, processing, and analyzing data to gain insights for many purposes.

2. Why is Data Science Important?

Data science has emerged as a revolutionary field that is crucial in generating insights from data and transforming businesses. It's not an overstatement to say that data science is the backbone of modern industries. But why has it gained so much significance?

a.       Data volume. Firstly, the rise of digital technologies has led to an explosion of data. Every online transaction, social media interaction, and digital process generates data. However, this data is valuable only if we can extract meaningful insights from it. And that's precisely where data science comes in.

b.       Value-creation. Secondly, data science is not just about analyzing data; it's about interpreting and using this data to make informed business decisions, predict future trends, understand customer behavior, and drive operational efficiency. This ability to drive decision-making based on data is what makes data science so essential to organizations.

3. The data science lifecycle

The data science lifecycle refers to the various stages a data science project generally undergoes, from initial conception and data collection to communicating results and insights. Despite every data science project being unique—depending on the problem, the industry it's applied in, and the data involved—most projects follow a similar lifecycle. This lifecycle provides a structured approach for handling complex data, drawing accurate conclusions, and making data-driven decisions.

The data science lifecycle

Here are the five main phases that structure the data science lifecycle:

a. Data collection and storage

This initial phase involves collecting data from various sources, such as databases, Excel files, text files, APIs, web scraping, or even real-time data streams. The type and volume of data collected largely depend on the problem you’re addressing.

Once collected, this data is stored in an appropriate format ready for further processing. Storing the data securely and efficiently is important to allow quick retrieval and processing.

b. Data preparation

Often considered the most time-consuming phase, data preparation involves cleaning and transforming raw data into a suitable format for analysis. This phase includes handling missing or inconsistent data, removing duplicates, normalization, and data type conversions. The objective is to create a clean, high-quality dataset that can yield accurate and reliable analytical results.

 c. Exploration and visualization

During this phase, data scientists explore the prepared data to understand its patterns, characteristics, and potential anomalies. Techniques like statistical analysis and data visualization summarize the data's main characteristics, often with visual methods.

Visualization tools, such as charts and graphs, make the data more understandable, enabling stakeholders to comprehend the data trends and patterns better.

d. Experimentation and prediction

Data scientists use machine learning algorithms and statistical models to identify patterns, make predictions, or discover insights in this phase. The goal here is to derive something significant from the data that aligns with the project's objectives, whether predicting future outcomes, classifying data, or uncovering hidden patterns.

e. Data Storytelling and communication

The final phase involves interpreting and communicating the results derived from the data analysis. It's not enough to have insights; you must communicate them effectively, using clear, concise language and compelling visuals. The goal is to convey these findings to non-technical stakeholders in a way that influences decision-making or drives strategic initiatives.

4. What is Data Science Used For?

Data science is used for an array of applications, from predicting customer behavior to optimizing business processes. The scope of data science is vast and encompasses various types of analytics.

Descriptive analytics. Analyzes past data to understand current state and trend identification. For instance, a retail store might use it to analyze last quarter's sales or identify best-selling products.

Diagnostic analytics. Explores data to understand why certain events occurred, identifying patterns and anomalies. If a company's sales fall, it would identify whether poor product quality, increased competition, or other factors caused it.

Predictive analytics. Uses statistical models to forecast future outcomes based on past data, used widely in finance, healthcare, and marketing. A credit card company may employ it to predict customer default risks.

Prescriptive analytics. Suggests actions based on results from other types of analytics to mitigate future problems or leverage promising trends. For example, a navigation app advising the fastest route based on current traffic conditions.


Unit 2-Introduction to NumPy

Introduction to NumPy

“The goal is to turn data into information, and information into insight.”— Carly Fiorina

6.1 Introduction
NumPy stands for ‘Numerical Python’. It is a package for data analysis and scientific computing
with Python. NumPy uses a multidimensional array object, and has functions and tools for working with these arrays. The powerful n-dimensional array in NumPy speeds-up data processing. NumPy can be easily interfaced with other Python packages and provides tools for integrating with other programming languages like C, C++ etc.

Installing NumPy
NumPy can be installed by typing following command:
pip install NumPy 

6.2 Array
We have learnt about various data types like list, tuple, and dictionary. In this chapter we will discuss another datatype ‘Array’. An array is a data type used to store multiple values using a single identifier (variable name).
An array contains an ordered collection of data elements where each element is of the same type and can be referenced by its index (position).
The important characteristics of an array are:
• Each element of the array is of same data type, though the values stored in them may be
different.

• The entire array is stored contiguously in memory. This makes operations on array fast.
• Each element of the array is identified or referred using the name of the Array along with
the index of that element, which is unique for each element. The index of an element is an
integral value associated with the element, based on the element’s position in the array.
For example consider an array with 5 numbers:
 [ 10, 9, 99, 71, 90 ]
Here, the 1st value in the array is 10 and has the index value [0] associated with it; the 2nd value in the
array is 9 and has the index value [1] associated with it, and so on. The last value (in this case the 5th value) in this array has an index [4]. This is called zero based indexing. This is very similar to the indexing of lists in Python. The idea of arrays is so important that almost all programming languages support it in one form or another.

6.3 NumPy Array
NumPy arrays are used to store lists of numerical data, vectors and matrices. The NumPy library has a large set of routines (built-in functions) for creating, manipulating, and transforming NumPy arrays. Python language also has an array data structure, but it is not as versatile, efficient and useful as the NumPy array. The NumPy array is officially called ndarray but commonly known as array. In rest of the chapter, we will be referring to NumPy array whenever we use “array”. following are few differences between list and Array.

6.3.1 Difference Between List and Array List Array
This is an Assignment

6.3.2 Creation of NumPy Arrays from List
There are several ways to create arrays. To create an array and to use its methods, first we need to import the NumPy library.
#NumPy is loaded as np (we can assign any #name), numpy must be written in lowercase
>>> import numpy as np
The NumPy’s array() function converts a given list
into an array. For example,
#Create an array called array1 from the
#given list.
>>> array1 = np.array([10,20,30])
#Display the contents of the array
>>> array1
array([10, 20, 30])

• Creating a 1-D Array
An array with only single row of elements is called
1-D array. Let us try to create a 1-D array from
a list which contains numbers as well as strings.
>>> array2 = np.array([5,-7.4,'a',7.2])
>>> array2
array(['5', '-7.4', 'a', '7.2'],
dtype='<U32')

Observe that since there is a string value in the list, all integer and float values have been promoted to
string, while converting the list to array.

Note: U32 means Unicode-32 data type.

• Creating a 2-D Array
We can create a two dimensional (2-D) arrays by passing nested lists to the array() function.

Example 6.1
>>> array3 = np.array([[2.4,3],
 [4.91,7],[0,-1]])
>>> array3
array([[ 2.4 , 3. ],
 [ 4.91, 7. ],
 [ 0. , -1. ]])
Observe that the integers 3, 7, 0 and -1 have been promoted to floats.

6.3.3 Attributes of NumPy Array

Some important attributes of a NumPy ndarray object are:
i) ndarray.ndim: gives the number of dimensions of the array as an integer value. Arrays can be
1-D, 2-D or n-D. In this chapter, we shall focus on 1-D and 2-D arrays only. NumPy calls the
dimensions as axes (plural of axis). Thus, a 2-D array has two axes. The row-axis is called axis-0
and the column-axis is called axis-1. The number of axes is also called the array’s rank.

Example 6.2
>>> array1.ndim
1
>>> array3.ndim
2
ii) ndarray.shape: It gives the sequence of integers indicating the size of the array for each dimension.

Example 6.3
# array1 is 1D-array, there is nothing
# after , in sequence
>>> array1.shape
(3,)
>>> array2.shape
(4,)
>>> array3.shape
(3, 2)
The output (3, 2) means array3 has 3 rows and 2
columns.
iii) ndarray.size: It gives the total number of elements of the array. This is equal to the product
of the elements of shape.

Example 6.4
>>> array1.size
3
>>> array3.size
6
iv) ndarray.dtype: is the data type of the elements of the array. All the elements of an array are of
same data type. Common data types are int32, int64, float32, float64, U32, etc.

Example 6.5
>>> array1.dtype
dtype('int32')
>>> array2.dtype
dtype('<U32>')
>>> array3.dtype
dtype('float64')

v) ndarray.itemsize: It specifies the size in bytes of each element of the array. Data type int32 and
float32 means each element of the array occupies 32 bits in memory. 8 bits form a byte. Thus, an
array of elements of type int32 has itemsize 32/8=4 bytes. Likewise, int64/float64 means each item
has itemsize 64/8=8 bytes.

Example 6.6
>>> array1.itemsize
4 # memory allocated to integer
>>> array2.itemsize
128 # memory allocated to string
>>> array3.itemsize
8 #memory allocated to float type

6.3.4 Other Ways of Creating NumPy Arrays
1. We can specify data type (integer, float, etc.) while creating array using dtype as an argument to
array(). This will convert the data automatically to the mentioned type. In the following example,
nested list of integers are passed to the array function. Since data type has been declared
as float, the integers are converted to floating point numbers.

>>> array4 = np.array( [ [1,2], [3,4] ],
 dtype=float)
>>> array4
array([[1., 2.],
 [3., 4.]])

2. We can create an array with all elements initialised to 0 using the function zeros(). By default, the
data type of the array created by zeros() is float. The following code will create an array with 3 rows
and 4 columns with each element set to 0.

>>> array5 = np.zeros((3,4))
>>> array5

array([[0., 0., 0., 0.],
 [0., 0., 0., 0.],
 [0., 0., 0., 0.]])

3. We can create an array with all elements initialised to 1 using the function ones(). By default, the
data type of the array created by ones() is float. The following code will create an array with 3 rows
and 2 columns.
>>> array6 = np.ones((3,2))
>>> array6
array([[1., 1.],
 [1., 1.],
 [1., 1.]])

4. We can create an array with numbers in a given range and sequence using the arange() function.
This function is analogous to the range() function of Python.
>>> array7 = np.arange(6)
# an array of 6 elements is created with
start value 5 and step size 1
>>> array7
array([0, 1, 2, 3, 4, 5])
# Creating an array with start value -2, end
# value 24 and step size 4
>>> array8 = np.arange( -2, 24, 4 )
>>> array8
array([-2, 2, 6, 10, 14, 18, 22])

6.4 Indexing and Slicing
NumPy arrays can be indexed, sliced and iterated over.

6.4.1 Indexing
We have learnt about indexing single-dimensional array in section 6.2. For 2-D arrays indexing for both
dimensions starts from 0, and each element is referenced through two indexes i and j, where i represents the row number and j represents the column number.

Table 6.1 Marks of students in different subjects
Name             Maths         English         Science
Ramesh             78                 67                 56
Vedika               76                 75                 47
Harun                84                 59                 60
Prasad               67                 72                 54
Consider Table 6.1 showing marks obtained by students in three different subjects. Let us create an
array called marks to store marks given in three subjects for four students given in this table. As there are 4 students (i.e. 4 rows) and 3 subjects (i.e. 3 columns), the array will be called marks[4][3]. This array can store 4*3 = 12 elements.

Here, marks[i,j] refers to the element at (i+1)th row and (j+1)th column because the index values start at 0. Thus marks[3,1] is the element in 4th row and second column which is 72 (marks of Prasad in English).

# accesses the element in the 1st row in
# the 3rd column
>>> marks[0,2]
56
>>> marks [0,4]
index Out of Bound "Index Error". Index 4
is out of bounds for axis with size 3

6.4.2 Slicing
Sometimes we need to extract part of an array. This is done through slicing. We can define which part of the array to be sliced by specifying the start and end index values using [start : end] along with the array name.
Example 6.7
>>> array8
array([-2, 2, 6, 10, 14, 18, 22])
# excludes the value at the end index
>>> array8[3:5]
array([10, 14])
# reverse the array
>>> array8[ : : -1]
array([22, 18, 14, 10, 6, 2, -2])

Now let us see how slicing is done for 2-D arrays.For this, let us create a 2-D array called array9 having
3 rows and 4 columns.
>>> array9 = np.array([[ -7, 0, 10, 20],
 [ -5, 1, 40, 200],
 [ -1, 1, 4, 30]])

# access all the elements in the 3rd column
>>> array9[0:3,2]
array([10, 40, 4])
Note that we are specifying rows in the range 0:3
because the end value of the range is excluded.
# access elements of 2nd and 3rd row from 1st
# and 2nd column
>>> array9[1:3,0:2]
array([[-5, 1],
 [-1, 1]])
If row indices are not specified, it means all the rows
are to be considered. Likewise, if column indices are
not specified, all the columns are to be considered.
Thus, the statement to access all the elements in the 3rd
column can also be written as:
>>>array9[:,2]
array([10, 40, 4])

6.5 Operations on Arrays
Once arrays are declared, we con access it's element
or perform certain operations the last section, we
learnt about accessing elements. This section describes
multiple operations that can be applied on arrays.
6.5.1 Arithmetic Operations
Arithmetic operations on NumPy arrays are fast and
simple. When we perform a basic arithmetic operation
like addition, subtraction, multiplication, division etc. on
two arrays, the operation is done on each corresponding
pair of elements. For instance, adding two arrays will
result in the first element in the first array to be added
to the first element in the second array, and so on.
Consider the following element-wise operations on two
arrays:
>>> array1 = np.array([[3,6],[4,2]])
>>> array2 = np.array([[10,20],[15,12]])

#Element-wise addition of two matrices.
>>> array1 + array2
array([[13, 26],
 [19, 14]])
#Subtraction
>>> array1 - array2
array([[ -7, -14],
 [-11, -10]])
#Multiplication
>>> array1 * array2
array([[ 30, 120],
 [ 60, 24]])
#Matrix Multiplication
>>> array1 @ array2
array([[120, 132],
 [ 70, 104]])
#Exponentiation
>>> array1 ** 3
array([[ 27, 216],
 [ 64, 8]], dtype=int32)
#Division
>>> array2 / array1
array([[3.33333333, 3.33333333],
 [3.75 , 6. ]])
#Element wise Remainder of Division
#(Modulo)
>>> array2 % array1
array([[1, 2],
 [3, 0]], dtype=int32)
It is important to note that for element-wise
operations, size of both arrays must be same. That is,
array1.shape must be equal to array2.shape. 

6.5.2 Transpose
Transposing an array turns its rows into columns and
columns into rows just like matrices in mathematics.
#Transpose
>>> array3 = np.array([[10,-7,0, 20],
 [-5,1,200,40],[30,1,-1,4]])
>>> array3
array([[ 10, -7, 0, 20],
 [ -5, 1, 200, 40],
 [ 30, 1, -1, 4]])
# the original array does not change
>>> array3.transpose()
array([[ 10, -5, 30],
 [ -7, 1, 1],
 [ 0, 200, -1],
 [ 20, 40, 4]])

6.5.3 Sorting
Sorting is to arrange the elements of an array in
hierarchical order either ascending or descending. By
default, numpy does sorting in ascending order.
>>> array4 = np.array([1,0,2,-3,6,8,4,7])
>>> array4.sort()
>>> array4
array([-3, 0, 1, 2, 4, 6, 7, 8])
In 2-D array, sorting can be done along either of the
axes i.e., row-wise or column-wise. By default, sorting
is done row-wise (i.e., on axis = 1). It means to arrange
elements in each row in ascending order. When axis=0,
sorting is done column-wise, which means each column
is sorted in ascending order.
>>> array4 = np.array([[10,-7,0, 20],
 [-5,1,200,40],[30,1,-1,4]])
>>> array4
array([[ 10, -7, 0, 20],
 [ -5, 1, 200, 40],
 [ 30, 1, -1, 4]])
#default is row-wise sorting
>>> array4.sort()
>>> array4
array([[ -7, 0, 10, 20],
 [ -5, 1, 40, 200],
 [ -1, 1, 4, 30]])
>>> array5 = np.array([[10,-7,0, 20],
 [-5,1,200,40],[30,1,-1,4]])
#axis =0 means column-wise sorting
>>> array5.sort(axis=0)
>>> array5
array([[ -5, -7, -1, 4],
 [ 10, 1, 0, 20],
 [ 30, 1, 200, 40]])

6.6 Concatenating Arrays
Concatenation means joining two or more arrays.
Concatenating 1-D arrays means appending the
sequences one after another. NumPy.concatenate() 
function can be used to concatenate two or more
2-D arrays either row-wise or column-wise. All the
dimensions of the arrays to be concatenated must match
exactly except for the dimension or axis along which
they need to be joined. Any mismatch in the dimensions
results in an error. By default, the concatenation of the
arrays happens along axis=0. 

Example 6.8
>>> array1 = np.array([[10, 20], [-30,40]])
>>> array2 = np.zeros((2, 3), dtype=array1.
 dtype)
>>> array1
array([[ 10, 20],
 [-30, 40]])
>>> array2
array([[0, 0, 0],
 [0, 0, 0]])
>>> array1.shape
(2, 2)
>>> array2.shape
(2, 3)
>>> np.concatenate((array1,array2), axis=1)
array([[ 10, 20, 0, 0, 0],
 [-30, 40, 0, 0, 0]])
>>> np.concatenate((array1,array2), axis=0)
Traceback (most recent call last):
 File "<pyshell#3>", line 1, in <module>
 np.concatenate((array1,array2))
ValueError: all the input array dimensions
except for the concatenation axis must
match exactly

6.7 Reshaping Arrays
We can modify the shape of an array using the reshape()
function. Reshaping an array cannot be used to change
the total number of elements in the array. Attempting
to change the number of elements in the array using
reshape() results in an error.
Example 6.9
>>> array3 = np.arange(10,22)
>>> array3
array([10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21])
>>> array3.reshape(3,4)
array([[10, 11, 12, 13],
 [14, 15, 16, 17],
 [18, 19, 20, 21]])
>>> array3.reshape(2,6)
array([[10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21]])
6.8 Splitting Arrays
We can split an array into two or more subarrays.
numpy.split() splits an array along the specified axis.
We can either specify sequence of index values where an
array is to be split; or we can specify an integer N, that
indicates the number of equal parts in which the array
is to be split, as parameter(s) to the NumPy.split()
function. By default, NumPy.split() splits along axis =
0. Consider the array given below:
>>> array4
array([[ 10, -7, 0, 20],
 [ -5, 1, 200, 40],
 [ 30, 1, -1, 4],
 [ 1, 2, 0, 4],
 [ 0, 1, 0, 2]])
# [1,3] indicate the row indices on which
# to split the array
>>> first, second, third = numpy split(array4,
 [1, 3])
# array4 is split on the first row and
# stored on the sub-array first
>>> first
array([[10, -7, 0, 20]])
# array4 is split after the first row and
# upto the third row and stored on the
# sub-array second
>>> second
array([[ -5, 1, 200, 40],
 [ 30, 1, -1, 4]])
# the remaining rows of array4 are stored
# on the sub-array third
>>> third
array([[1, 2, 0, 4],
 [0, 1, 0, 2]])

#[1, 2], axis=1 give the columns indices
#along which to split
>>> firstc, secondc, thirdc =numpy split(array4,
[1, 2], axis=1)
>>> firstc
array([[10],
 [-5],
 [30],
 [ 1],
 [ 0]])
>>> secondc
array([[-7],
 [ 1],
 [ 1],
 [ 2],
 [ 1]])
>>> thirdc
array([[ 0, 20],
 [200, 40],
 [ -1, 4],
 [ 0, 4],
 [ 0, 2]])

6.9 Statistical Operations on Arrays
NumPy provides functions to perform many useful
statistical operations on arrays. In this section, we will
apply the basic statistical techniques called descriptive
statistics that we have learnt in chapter 5.

Let us consider two arrays:
>>> arrayA = np.array([1,0,2,-3,6,8,4,7])
>>> arrayB = np.array([[3,6],[4,2]])
1. The max() function finds the maximum element
from an array.
# max element form the whole 1-D array
>>> arrayA.max()
8
# max element form the whole 2-D array
>>> arrayB.max()
6
# if axis=1, it gives column wise maximum
>>> arrayB.max(axis=1)
array([6, 4])
# if axis=0, it gives row wise maximum
>>> arrayB.max(axis=0)
array([4, 6])
2. The min() function finds the minimum element
from an array.
>>> arrayA.min()
-3
>>> arrayB.min()
2
>>> arrayB.min(axis=0)
array([3, 2])

6.10 Loading Arrays from Files
Sometimes, we may have data in files and we may need
to load that data in an array for processing. numpy.
loadtxt() and numpy.genfromtxt()are the two
functions that can be used to load data from text files.
The most commonly used file type to handle large amount
of data is called CSV (Comma Separated Values). 

6.10.2 Using NumPy.genfromtxt()
genfromtxt() is another function in NumPy to load data
from files. As compared to loadtxt(), genfromtxt()
can also handle missing values in the data file. Let us
look at the following file dataMissing.txt with some
missing values and some non-numeric data:
RollNo Marks1 Marks2 Marks3
1, 36, 18, 57
2, ab, 23, 45
3, 43, 51,
4, 41, 40, 60
5, 13, 18, 27
>>> dataarray = np.genfromtxt('C:/NCERT/
 dataMissing.txt',skip_header=1,
 delimiter = ',')
The genfromtxt() function converts missing values
and character strings in numeric columns to nan. But if
we specify dtype as int, it converts the missing or other
non numeric values to -1. We can also convert these
missing values and character strings in the data files
to some specific value using the parameter filling_
values. 
6.11 Saving NumPy Arrays in Files on Disk
The savetxt() function is used to save a NumPy array
to a text file.
Example 6.11
>>> np.savetxt('C:/NCERT/testout.txt',
studentdata, delimiter=',', fmt='%i')
Note: We have used parameter fmt to specify the format in
which data are to be saved. The default is float.


Comments

Popular posts from this blog

Complete Machine Learning Notes for BCA Final Year Students

Data Structure & Algorithms for M.C.A.