data mining decision tree


Introduction

The decision tree is a structure that includes root node, branch and leaf node. Each internal node denotes a test on attribute, each branch denotes the outcome of test and each leaf node holds the class label. The topmost node in the tree is the root node.

The following decision tree is for concept buy_computer, that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents the test on the attribute. Each leaf node represents a class.

Decision Tree

Advantages of Decision Tree

  • It does not require any domain knowledge.
  • It is easy to assimilate by human.
  • Learning and classification steps of decision tree are simple and fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm. This Decision Tree Algorithm is known as ID3(Iterative Dichotomiser). Later he gave C4.5 which was successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm there is no backtracking, the trees are constructed in a top down recursive divide-and-conquer manner.

Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples 
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data 
tuples into individual classes. This criterion includes a 
splitting_attribute and either a splitting point or splitting subset.

Output:
 A Decision Tree

Method
create a node N;
if tuples in D are all of the same class, C then
   return N as leaf node labeled with class C;
if attribute_list is empty then
   return N as leaf node with labeled 
   with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list) 
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
   multiway splits allowed then  // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
   // partition the tuples and grow subtrees for each partition
   let Dj be the set of data tuples in D satisfying outcome j; // a partition
   if Dj is empty then
      attach a leaf labeled with the majority 
      class in D to node N;
   else 
      attach the node returned by Generate 
      decision tree(Dj, attribute list) to node N;
   end for
return N;

Tree Pruning

Tree Pruning is performed in order to remove anomalies in training data due to noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches

Here is the Tree Pruning Approaches listed below:

  • Prepruning – The tree is pruned by halting its construction early.
  • Postpruning – This approach removes subtree form fully grown tree.

Cost Complexity

The cost complexity is measured by following two parameters:

  • Number of leaves in the tree
  • Error rate of the tree

project idea(model based image compression of medical images)

Project Idea | (Model based Image Compression of Medical Images)



The project is about providing fast transfer of medical images to/from rural areas where bandwidth is low. The idea is to keep model medical images at all locations (rural and urban). To transfer a patient’s image from one location to another, find the difference image from patients image to model image. The difference image would have less data to transfer. To further minimize size of difference image, use Image Registration. So the sending side sends a difference image, the receiving side adds this image to model image to get the patient’s image.
Research:
There can be specialized methods to compress difference images. One method is discussed in below reference paper.
Tools:
If we want to do research oriented project for compression, Matlab can be used. To build complete application with networking,Java can be used. 
Reference:
Inderscience Journal paper onModel-based image compression framework for CT and MRI images 
If you also wish to showcase your project idea here, please send an email to contribute@geeksforgeeks.org. 

project idea(remote lab assistance)

Project Idea | (Remote Lab Assistance)



The idea is to provide a framework for students and instructor. The framework provides an instructor-friendly remote monitoring of lab, effective evaluation, and grading methodology. The system also provides a student-friendly remote login, software access, and problem resolution through effective help from the teacher. The framework can be easily implemented as a client-server in Java.
Features:
Instructor Panel: 
    a) Sees icons for all students. Can click on an icon to see what student is doing. 
    b) Can chat with students
    c) Can identify copying/opening windows other than IDE. An icon blinks if there is a sudden change in picture.
Student Panel: 
    a) Can send a help request to instructor. 
    b) Can chat with instructor
Implementation:
We can capture screenshots of all students and send them at fixed intervals (say 2 seconds) to the instructor.
Tools:
Java provides rich libraries for networking and Image processing. Java and Netbeans can be downloaded from below link:
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Research:
a) Image compression techniques specialized for images that contains programming text.
b) Feature (c) mentioned in instructor panel is interesting.
Reference:
IEEE Transaction paper onAddressing the Bandwidth Efficiency, Control, and Evaluation Issues in Software Remote Laboratory
If you also wish to showcase your project idea here, please send an email to contribute@geeksforgeeks.org. 

project idea(personalized real time update system)

Project Idea | (Personalized real-time update system)



The prime motive is to create a framework to get updates in real time. The updates can be news updates, emergency traffic alerts or an update from any social networking website. The updates are going to be personalized as they will be based on multiple factors like user’s geographical location, user’s preferences and social networks i.e. Facebook Friends, twitter followers etc. 
Features:
Updates can be classified as coming from trustworthy sources like news websites, verified twitter accounts etc. and from untrustworthy sources like friend’s status updates, tweet from any unverified twitter account etc.
Research:
Different types of Data Mining and Text Mining techniques can be researched to get the most optimal and relevant updates. 
Tools:
For developing the front end, any Javascript based framework like Ext JS, Angular JS can be used. For backend, PHP can be used to interact with Database.
References:
http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1509

About the author:


“Harshit is a technology enthusiast and has keen interest in programming. He holds aharshit-jain  B.Tech. degree in Computer Science from JIIT, Noida and currently works as Front-end Developer at SAP. He is also a state level table tennis player. Apart from this he likes to unwind by watching movies and English sitcoms. He is based out of Delhi and you can reach out to him at https://in.linkedin.com/pub/harshit-jain/2a/129/bb5

project idea(character recognition from image)

Project Idea | ( Character Recognition from Image )



Aim : The aim of this project is to develop such a tool which takes an Image as input and extract characters (alphabets, digits, symbols) from it. The Image can be of handwritten document or Printed document. It can be used as a form of data entry from printed records. 
Tool : This project is based on Machine learning, We can provide a lot of data set as an Input to the software tool which will be recognized by the machine and similar pattern will be taken out from them. We can use Matlab or Octave as a building tool for this product but Octave is recommended in initial state as its free and easy to use.
Research : A lot of research is going on this product and which is still going on. Research areas include image processing, natural language processing, artificial Intelligence and machine learning. 
Implementation : The Implementation of such a tool depends on two factors – Feature extraction and classification algorithm. So you can use various classifiers available online and also read about basic feature extraction algorithm. The basic version of the product(of less accuracy) can be implemented in Octave with limited training data set and simple component analysis. Refer below links for more information about implementation and ongoing research. 
http://perun.pmf.uns.ac.rs/radovanovic/dmsem/completed/2006/OCR.pdf
http://crypto.stanford.edu/~dwu4/papers/ICDAR2011.pdf 
http://yann.lecun.com/exdb/publis/pdf/matan-90.pdf
There are also online available tool which recognizes character from image and convert them to machine coded characters in form of doc or txt formate – http://www.onlineocr.net/
The field of such tools is too large, you can learn a lot about above technologies by contributing to ongoing projects or creating your own from scratch. 
This idea is contributed byUtkarsh Trivedi. If you also wish to showcase your project idea here, please send an email to contribute@geeksforgeeks.org.

project idea(static code checker for c++) 

Project Idea | (Static Code Checker for C++)



The biggest problem that students face when they join big corporates is difficulty in writing high quality code that these corporates demand. The prime reason for this difficulty is because their minds have been trained in college to just make things work somehow, even if it means using dirty hacks. To help coders in general and fellow college students in particular, building a tool which can run static code checks on a given code can help improve the quality of code to a great extend.
Features:
Static code checker can check and warn the programmer about best practices, possible mistakes, loopholes without even executing the code. For example.

  • Memory leaks
  • Unused variables
  • Undeclared variables
  • Array’s bound checks
  • Dead code


Research:
There are lots of best practices which should be followed in language like C++ to ensure that written code is of high quality. More research can be done about what are the various best practices, loopholes, obvious errors which the project can take into consideration. 
Implementation:
Static code checker could be written as a plugin to any existing IDE like Eclipse/Codeblocks (recommended) or it can be in the form of any website where you paste your code and run static code checks.
References:
There are a lot of existing static code checkers available. For example the best static code checkers available for Javascript are JsLint and JsHint.
http://www.jslint.com/
http://jshint.com/

About the author:


“Harshit is a technology enthusiast and has keen interest in programming. He holds aharshit-jain  B.Tech. degree in Computer Science from JIIT, Noida and currently works as Front-end Developer at SAP. He is also a state level table tennis player. Apart from this he likes to unwind by watching movies and English sitcoms. He is based out of Delhi and you can reach out to him at https://in.linkedin.com/pub/harshit-jain/2a/129/bb5

project idea(brain computer interface)

Project Experience | (Brain Computer Interface)



Introduction:
I worked on Brain Computer Interface Technology under Cybersecurity at the University of North Texas for two months as my summer internship. We closely worked with some Ph.D. students under the monitoring of mentor researcher. It was a research based project wherein we were given the task to discover new functionalities of two pre-invented BCI devices – Neurosky Mindwave and Emotiv EPOC.
Application:
Prior to the start of the application, the user was asked to think about a particular number from 0 to 9, for around 30 seconds. We could develop an application that flashed random numbers from 0 to 9 on the screen per second for an adjustable duration of 20-30 seconds. Looking at the flashing numbers, the user was asked to identify or look for the number that he/she had been thinking about. One of the BCI devices could be used to capture EEG values from the user while he/she was undergoing test with our application. These EEG values of the brain of the user were recorded in a Microsoft Access datasheet along with the values of brain voltage for each corresponding EEG value. We got approximately 512 EEG values per second, i.e., for each flash of a number we had 512 different values from the brain. We used programming in Python to filter the recorded data using the Butterworth filter in order to remove the unwanted noises in the data. The application interface and front end was created using C#. Based on the filtered EEG values and using Java coding we could identify two essential values: P300 and N400. P300 is the highest positive amplitude value of EEG which is incurred around the 300th second. A P300 EEG value would be generated by the user brain when he could find his number flashing on the screen. Out of the whole data, the number which had the highest degree of P300s was supposedly the number that the user was thinking about prior to the test and was looking for during the test. Hence, the number in the human thoughts could be identified with an appreciable accuracy without asking the user to manually enter the number. The only thing that would be needed is that the user thinks about his number, uninterrupted for 30 seconds or less and tries to identify that number during the testing.
Usage:
This feature could specifically be used in the field of cyber-security for password protection. An application may be developed that would ask the user to think about his pin number and after the processing of the data, an authorized user may be given the access to his account based on the correct pin without actually having to enter the pin physically anywhere. This application may bring down the case of eavesdropping or hacking.
This article is contributed byGunjan Soni. If you like GeeksforGeeks and would like to contribute, you can also write an article and mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

project idea(cse webnode)

Project Idea | (CSE Webnode)



The idea behind developing the framework is to provide a user friendly environment to provide knowledge and give everyone a chance to learn.The intent of this web application is to provide information for students about their syllabus, previous year’s question papers. The system also provides a user-friendly login, software access.
Features:

  1. Sees the Syllabus Panel
  2. Sees the Question Papers Panel.


Implementation:
Front end was developed in HTML/CSS with BootStrap framework, and Back end in PHP- MySQL.
Tools:
XAMPP is a free and open source cross-platform web server solution stack package developed by Apache Friends, consisting mainly of the Apache HTTP Server, MySQL database, and interpreters for scripts written in the PHP and Perl programming languages.
https://www.apachefriends.org/index.html
Source Code:
The Source Code for this project is available in GitHub. To view the code click the following link:
https://github.com/NvThejaswini/CSEWebnode
References:
There are many tutorials for PHP. For example, to learn PHP, MySQL following link would be preferable.
http://www.w3schools.com/php/
The following link is download the XAMPP software and also the documentation for the beginners.
https://www.apachefriends.org/index.html
 
This article is contributed byN.VenkataThejaswini.If you like GeeksforGeeks and would like to contribute, you can also write an article and mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

project idea(project approval system)

Project Idea | (Project Approval System)



Academic Project management is a major issue which is faced by many educational institutes, the main reason for this is there is no automated system followed in any institute. College management/staff gathers all the project reports and project sources from students and store them physically in some locations probably libraries. To overcome this practical problem and also to make the process easy we developed a secured intranet application which is useful for each.
Features:
Admin panel:

  1. Provide user/password to each member
  2. Create new user, changing request.
  3. Can send notification to all members
  4. Create different types of roles and granting permission


Head of Department panel:

  1. Can see project details
  2. Approve project according to requirement
  3. Comment and feedback.


Project in-charge: 

  1. Can see project details
  2. Approve project according to requirement
  3. Comment and feedback


Internal guide:

  1. Can see project details
  2. Approve project according to requirement
  3. Comment and feedback


Student Panel:

  1. Can change own profile details and user/password given by admin
  2. Upload any number of project abstract,synopsis,report and software code
  3. Can see project approval stage
  4. Can see notification on mail after successful approval of project.


Note: each project uploaded by student will process from HOD to project in-charge to internal guide and student can see project status after approval of one authorities project will go to next phase .in any phase if project don’t fulfill requirement project will be rejected. 
Login Panel:

  1. Encrypted/ decrypted username/password.


Tools: 
JSP, Servlet, AJAX, Netbean, gmail api.
Research:
RSA algorithms for encryption and decyption.
This idea is contributed byJitendra Singh.If you like GeeksforGeeks and would like to contribute, you can also write an article and mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

project idea(online course registration)

Project Idea | (Online Course Registration)



The idea is to automate the manual process of registration of courses. This system provides a number of functionalities pertaining to COURSE REGISTRATION for the students as well as faculty members. Registration for the course is possible only if the student has paid the fees, i.e, has a valid fee receipt number. Students can login, view, register, drop courses, whereas teachers can login, view the number of students registered for their course, add a new course they are planning to teach, drop a course they are planning to not teach anymore etc. 
The entire system has been built using AngularJS framework.
Features
Students:
a) Can view/register/drop courses of their semester or previous semesters ( incase they have any backlogs and have to repeat the course)
b) Can view all the courses they have registered for at a given time
Faculty:
a) Can view the count and list of students registered for each course they teach
b) Can forward a request for the addition/removal of a course to the admin
Implementation:
Registration of courses is possible only after the payment of fees, hence the fee receipt number entered by the student is validated with the bank database.
Tools:
AngularJS is a Javascript framework used for the development and testing of rich internet applications. It extends the HTML vocabulary to create dynamic web pages. You can download AngularJS from here
This idea is contributed byMadhavi Srinivasan. If you like GeeksforGeeks and would like to contribute, you can also write an article and mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.