K-Means Clustering And its Business Use case

Rohit Dhore
4 min readJul 19, 2021

--

* Unsupervised learning

Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision.

The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format.

The unsupervised learning algorithm can be further categorized into two types of problems:

i) Clustering

ii) Association

Clustering: Clustering is a method of grouping the objects into clusters such that objects with most similarities remains into a group and has less or no similarities with the objects of another group.

* k-means Clustering ?

clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. in simple words, the aim is to segregate groups with similar traits and assign them into clusters. the goal of the k-means algorithm is to find groups in the data, with the number of groups represented by the variable k. the algorithm works iteratively to assign each data point to one of k groups based on the features that are provided. in the reference image below, k=2, and there are two clusters identified from the source dataset.

The outputs of executing a k-means on a dataset are:

· k centroids: centroids for each of the k clusters identified from the dataset.

· complete dataset labeled to ensure each data point is assigned to one of the clusters.

Steps in K-Means algorithm:

1. Choose the number of clusters K.

2. Select at random K points, the centroids(not necessarily from your dataset).

3. Assign each data point to the closest centroid → that forms K clusters.

4. Compute and place the new centroid of each cluster.

5. Reassign each data point to the new closest centroid. If any reassignment . took place, go to step 4, otherwise, the model is ready.

* Business Uses

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

*Behavioral segmentation:

  • Segment by purchase history
  • Segment by activities on application, website, or platform
  • Define personas based on interests
  • Create profiles based on activity monitoring

*Inventory categorization:

  • Group inventory by sales activity
  • Group inventory by manufacturing metrics

*Sorting sensor measurements:

  • Detect activity types in motion sensors
  • Group images
  • Separate audio
  • Identify groups in health monitoring

*Detecting bots or anomalies:

  • Separate valid activity groups from bots
  • Group valid activity to clean up outlier detection

~PROPOSED SYSTEM~

Configuration and specification are set, that is most crucial to the operation of the system,is the categorization of the most common security threats, such as DDoS attacks, malware, exploits, and vulnerabilities. This involves the definition of a set of search terms (keywords) associated with each class of threats. Each defined keyword must have an important level attached to it (weight), denoting the contribution of an occurrence of this word to the score of each document (Table I). These keyword lists must follow the format as follows:

• Threat Class 1:

keywords: {[keyword1, weight], [keyword2, weight]…

[keywordN, weight]}

• Threat Class 2:

keywords {[keyword1, weight], [keyword2, weight]…

[keywordN, weight]}..

• Threat Class N:

keywords: {[keyword1, weight], [keyword2, weight]…

[keywordN, weight]}

$ WORKING/IMPLENTATION

The k-mean clustering is performed on crime data sets with the use of rapid data tool. The simulation is carried out in steps. Firstly, a data set is obtained. Secondly, the obtained data set is filtered according to the requirements, and then, a new data set with the attributes according to the analysis to be conducted is created. Thirdly, an open minor tool is opened and then the excel file read. The “Replace the Missing value” operator is then applied, and then the operation executed. Fourthly, the “Normalize operator” is performed on the resulting data set and then operation executed.

Finally, k-means clustering is performed on the resultant data set after the normalization process. Finally, k-means clustering is then performed on the resultant data set after the normalization process. The analysis is then done on the cluster formed.

Sample Data
Organized Data

--

--

No responses yet