An Innovative Approach in Text Mining

Published on May 2016 | Categories: Types, Reviews, Book | Downloads: 38 | Comments: 0 | Views: 347

of 6

Content

ISSN: 2229-6093
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

An Innovative Approach in Text Mining
R.Santhanalakshmi, Research Scholar, Dept of MCA, ComputerCenter, Madurai Kamaraj University, Madurai. [email protected]
(1)

Dr.K.Alagarsamy, Associate Professor Dept of MCA, ComputerCenter, Madurai Kamaraj University, Madurai. [email protected]

(2)

Abstract: The text mining is the classification and predictive modelings that are based on bootstrapping techniques re-use a source data set for the specific application, which is specialized for avoid the information overloading and redundancy. The results offer a classification and prediction results are minimum compare with the original data source. Text is the common approach used to examine text and data in order to draw conclusions about the structure and relationships between sets of information contained in the original set or approximate the some expected values. In this paper we are going to retrieve the bovine diseases information form the internet using k-means clustering and principal component analysis. Keyword: Bovine Diseases, K-Means Clustering, Principal component analysis. I. Introduction: 1.1 Bovine Diseases: Bovine Diseases are the common diseases in cattle sector. It has variety of forms and N number of symptoms. Here we discuss some forms. BVDV is one of the common causes of infectious abortion. It is also correlated with a wide range of diseases from infertility to pneumonia, diarrhoea and poor growth. BVDV is normally the major viral cause

of disease in cattle. BVDV is belongs to the family of pestiviruses. Other diseases associated with other pestiviruses include classical swine fever and border disease in sheep. Pestiviruses infect cloven-hoofed stock only, BVDV has been found in pigs and sheep. BVDV causes such a wide range of disease it is rare to be able to diagnose because on clinical signs alone. Testing the blood for antibodies and virus is the best method of diagnosis. A paired blood sample for antibodies is useful for pneumonia, diarrhoea and infertility. If the first sample is taken when the animal is ill and the second two to three weeks later, a rise in antibodies suggests that there was active infection BVD is a viral disease of cattle caused by a pestivirus. It has many different manifestations in a herd, depending on the herd’s immune and reproductive status. Transient diarrhoea, mixed respiratory infection, infertility or abortion and mucosal disease are the most common clinical signs of the disease and can be seen simultaneously in a herd. Due to its varied manifestations and sub clinical nature in many herds, the significance of the disease has not been understood until recently, when diagnostic methods improved.Bovine herpes virus 1 is a virus of the family Herpesviridae that causes several diseases worldwide in cattle, including rhinotracheitis,
193

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

vaginitis, balanoposthitis, abortion, conjunctivitis and enteritis. BHV-1 is also a contributing factor in shipping fever. Bovine leukemia virus is a bovine virus closely related to HTLV-I, a human tumour virus. BLV is a retrovirus which integrates a DNA intermediate as a provirus into the DNA of Blymphocytes of blood and milk. It contains an oncogene coding for a protein called Tax. 1.2 K-Means Clustering: In statistics and machine learning, k-means clustering [4] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms. Procedure:  This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated. Samples are then moved to a cluster that records the shortest distance from a sample to the cluster centroid. As a first step to the cluster analysis, the user decides on the Number of Clusters ‘k’. This parameter could take definite

integer values with the lower bound of 1 an upper bound that equals the total number of samples. The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters. 1.3 Principal Component Analysis: The main basis of PCA-based dimension reduction is that PCA picks up the dimensions with the largest variances. Mathematically, this is equivalent to finding the best low rank approximation of the data via the singular value decomposition. However, this noise reduction property alone is inadequate to explain the effectiveness of PCA. PCA is a basic method of social network mining with applications to ranking and clustering that can be further deployed in marketing, in user segmentation by selecting communities with desired or undesired properties as. In particular the friends list of a blog can be used for social filtering, that is reading posts that their friends write or recently read. Principal Component Analysis is similar to the HITS ranking algorithm; in fact the hub and authority ranking is defined by the first left and right singular vectors and the use of higher dimensions is suggested already and analyzed in detail in Several authors use HITS for measuring authority in mailing lists or blogs , the latter result observing a strong correlation of HITS score and degree, indicating that the first principal axis will contain no high-level information but simply order by number of friends. We demonstrate that HITS-







194

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

style ranking can be used but with special care due to the Tightly Knit Community effect that result in communities that are small on a global level grabbing the first principal axes. The probably the first who identify the TKC problem in the HITS algorithm, their algorithmic solution however turns out to merely compute in and outdegrees. In contrast we keep PCA as the underlying matrix method and filter the relevant high-level structural information by removing TK II. Proposed Method (SAN Method): In our method we combined the k-means clustering and principal component analysis for the effective clustering and optimized solution. While searching the information from the internet we have to get what information we required until otherwise that searching becomes a null and void. Every clustering method has its own strategy and importance. We can’t say the single clustering mechanism enough for every kind of search and also we can’t ensure every clustering method provide the same result for same key term. For this reason we combined the both clustering technique and gave new innovative idea to optimizing the searching from the large data base or internet, etc. Both techniques are some what related to clustering technique. KMeans clustering grouping the source data into certain groups called as clusters based on some distance measures technique. Principal component analyze focusing dimension reduction based on the mathematical models. Our domain information’s related to Bovine Disease, which are very specific instead of searching the all

domains. Even though we have specific domain we should search through out the internet if it’s online otherwise in large data base. As our earlier work we used modified HITS algorithm to search and another one we used stemming algorithm with hierarchical clustering. In this we combine the K-means and Principal component analysis and evaluate the results. Our research end with comparison making between al those things and optimize which technique better for my work. Bovine diseases keyword given for searching element using that keyword first we form the initial clusters. For example we have n sample feature vectors bv1, bv2… bvn all from the same class, and we know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. here for calculating the mean value we use Euclidean distance formula which standard algorithm as well as simple algorithm to find out the distance between two elements. If the clusters are well separated, we can use a minimumdistance classifier to separate them. That is, we can say that x is in cluster i if x mi is the minimum of all the k distances. This suggests the following procedure for finding the k means:   Make initial guesses for the means m1, m2... mk. Until there are no changes in any mean o Use the estimated means to classify the samples into clusters o For i from 1 to k o Replace bvi with the mean of all of the samples for cluster i o end_for

195

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198



end_until

the second greatest variance on the second coordinate, and so on. Define a data matrix, XT, with zero samples mean, where each of the n rows represents a different repetition of data form the different experiment, and each of the m columns gives a particular kind of datum .The singular value decomposition of X is X = W Σ VT, where the m × m matrix W is the matrix of eigenvectors of XXT, the matrix Σ is an m × n rectangular diagonal matrix with nonnegative real numbers on the diagonal, and the matrix V is n × n. The PCA transformation that preserves dimensionality is then given by:
yT  X TW

In addition to improve K- means algorithm while forming the clustering analysis we include the brute force stemming algorithm, suffix stripping and brute force approach. Brute force stemmers maintain the lookup table which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root is replaced by the original word. Suffix stripping algorithms not like lookup table Instead, a typically smaller list of rules are stored which provide a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include, Rule1) if the word ends in 'ed', remove the 'ed'.Rule2) if the word ends in 'ing', remove the 'ing'.Rule3) if the word ends in 'ly', remove the 'ly'. Like wise they form some group of clusters at final stage but we can’t stay these are the final optimized result so that we going to analyze this cluster further for that final clusters distributed into Principal component analysis. Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance (difference) by any projection of the data comes on the first coordinate

V

T

V is not uniquely defined in the usual case when m<n−1, but Y will usually still be uniquely defined. Since W is an orthogonal matrix, each row of YT is simply a rotation of the corresponding row of XT. The first column of YT is made up of the scores of the cases with respect to the principal component; the next column has the scores with respect to the second principal component, and so on. If we want a reduced-dimensionality representation, we can project X down into the reduced space defined by only the first L singular vectors, WL:
Y  WL X  L VL
T T

The matrix W of singular vectors of X is equivalently the matrix W of eigenvectors of the matrix of observed covariance’s C = X XT,

196

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

XX T  W 



T

WT

Given a set of points in Euclidean space, the first principal component corresponds to a line that passes through the multidimensional mean and minimizes the sum of squares of the distances of the points from the line. The second principal component corresponds to the same concept after all correlation with the first principal component has been subtracted out from the points. The singular values (in Σ) are the square roots of the eigenvalues of the matrix XXT. Each Eigen value is proportional to the portion of the variance that is correlated with each eigenvector. The sum of all the eigenvalues is equal to the sum of the squared distances of the points from their multidimensional Feeding Gouge Dehorhing Rhinotracheitis ataxia

mean. PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information. Finally we will get the reduced cluster as the output of our query. III. Result Analysis: Simulation will carry over in Matrix lab (Mat lab) software. For example take this as query: Symptoms of Bovine leukemia: First we will see the out come of K-means Clustering in

B-Sell leukemia Lymphocytes BLV Ataxia

Palption Colostrums Provirus Lymphocytes Mononucleosis B-Sell leukemia Colostrum leukaemia BLV Fig: 1 Fig: 2 The K-means cluster output value given to the principal component analysis. It will generate the variance matrix that will reduce into further steps finally we will get these things as the result of query Comparison Analysis will give performance evaluation of combined approach with linearly: Sample SAN KPCA size method means

197

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

2750 4550 7700 10100

0.91 0.81 0.84 0.89

0.87 0.75 0.65 0.73

0.82 0.69 0.62 0.65

As the result analysis depicts SAN method performance will high then any other methods. While increasing the data set performance ratio will decrease in Kmeans and principal component analysis IV. Conclusion: In this paper we provided an effective method for information retrieval in Bovine disease. The SAN method gives an optimum solution compare with principal component analysis and Kmeans. In our earlier work we focused in enhancing the Medline & Pubmed using modified hits algorithm and also we tried with stemming algorithms. It’s our conclusion among all those methods; The SAN method gave the effective solution for bovine disease searching methodology. V. References: [1] Lada A. Adamic and Natalie Glance. The political blogosphere and the 2004 u.s. election: divided they blog. In LinkKDD ’05: Proceedings of the 3rd international workshop on Link discovery, pages 36–43, New York, NY, USA, 2005. ACM. [2] Pedro Domingos and Matt Richardson. Mining the network value of customers. In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 57–66, New York, NY, USA, 2001. ACM. [3] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: membership, growth, and evolution. In KDD ’06:

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 44–54, New York, NY, USA, 2006. ACM Press. [4] D Cheng, R Kannan, S Vempala, and G Wang. On a recursive spectral algorithm for clustering from pairwise similarities. Technical report, MIT LCS Technical Report MIT-LCS-TR-906, 2003. [5] Matthew Hurst, Matthew Siegler, and Natalie Glance. On estimating the geographic distribution of social media. In Proceedings Int. Conf. on Weblogs and Social Media (ICWSM-2007), 2007. [6] M. Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter, 38(2):321–330, March 2004. [7] Jun Zhang, Mark S. Ackerman, and Lada Adamic. Expertise networks in online communities: structure and algorithms. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 221–230, New York, NY, USA, 2007. ACM Press.

198

An Innovative Approach in Text Mining

Comments

Content

Sponsor Documents

Recommended