Data Query Language: Where clause, Order by, Group by

Where clause

The WHERE clause is used to filter records.

It is used to extract only those records that fulfill a specified condition.

WHERE Syntax

  • SELECT column1, column2, …

  • FROM table_name

  • WHERE condition;

Order by

Syntax of Order By in SQL:

  • SELECT column1, column2….
  • FROM table_name
  • ORDER BY column1 ASC/DESC, column2 ASC/DESC;

Example:

Sort all the students in ascending order in SQL by the “marks” column.

  • SELECT Name

  • FROM Student_details

  • ORDER BY Roll_no ASC;

Group by

It is used to arrange similar data into the group. The GROUP BY clause follows the WHERE clause and comes before the ORDER BY clause.

  • SELECT Name, Sum(marks)
  • FROM Student_details
  • GROUP BY Name;

DCL commands: Grant, Revoke

Data Controlling Language (DCL) helps users to retrieve and modify the data stored in the database with some specified queries. Grant and Revoke belong to these types of commands of the Data controlling Language. DCL is a component of SQL commands.

Grant:

SQL Grant command is specifically used to provide privileges to database objects for a user. This command also allows users to grant permissions to other users too.

Syntax:

grant privilege_name on object_name

to {user_name | public | role_name}

Here privilege_name is which permission has to be granted, object_name is the name of the database object, user_name is the user to which access should be provided, the public is used to permit access to all the users.

Revoke

Revoke command withdraw user privileges on database objects if any granted. It does operations opposite to the Grant command. When a privilege is revoked from a particular user U, then the privileges granted to all other users by user U will be revoked.

Syntax:

revoke privilege_name on object_name

from {user_name | public | role_name}

DDL commands: Create, Add, Drop

The Structure of Create Table Command

Table name is Student

Column name Data type  Size
 Reg_no  varchar2  10
 Name  char  30
 DOB      date  
 Address  varchar2  50

Insert

The ALTER Table Command

By The use of ALTER TABLE Command we can modify our exiting table.

Adding New Columns

Syntax:

ALTER TABLE <table_name>

         ADD (<NewColumnName> <Data_Type>(<size>),……n)

Example:

ALTER TABLE Student ADD (Age number(2), Marks number(3));

The Student table is already exist and then we added two more columns Age and Marks respectively, by the use of above command.

Delete

Syntax:

DROP TABLE <table_name>

Example:

DROP TABLE Student;

It will destroy the table and all data which will be recorded in it.

Update

UPDATE Syntax

UPDATE table_name

SET column1 = value1, column2 = value2, …

WHERE condition;

DML Commands: Insert, Delete, Update

Insert

The INSERT INTO statement of SQL is used to insert a new row in a table. There are two ways of using INSERT INTO statement for inserting rows:

  1. Only values: First method is to specify only the value of data to be inserted without the column names.

INSERT INTO table_name VALUES (value1, value2, value3,…);

table_name: name of the table.

value1, value2,.. : value of first column, second column,… for the new record

  1. Column names and values both: In the second method we will specify both the columns which we want to fill and their corresponding values as shown below:

INSERT INTO table_name (column1, column2, column3,..) VALUES ( value1, value2, value3,..);

table_name: name of the table.

column1: name of first column, second column …

value1, value2, value3 : value of first column, second column,… for the new record

Delete

The DELETE Statement in SQL is used to delete existing records from a table. We can delete a single record or multiple records depending on the condition we specify in the WHERE clause.

Syntax: Basic

DELETE FROM table_name WHERE some_condition;

table_name: name of the table

some_condition: condition to choose particular record.

Update

The UPDATE statement in SQL is used to update the data of an existing table in database. We can update single columns as well as multiple columns using UPDATE statement as per our requirement.

Basic Syntax

UPDATE table_name SET column1 = value1, column2 = value2,…

WHERE condition;

table_name: name of the table

column1: name of first , second, third column….

value1: new value for first, second, third column….

condition: condition to select the rows for which the values of columns needs to be updated.

TCL Commands: Commit, Roll Back, Save point

COMMIT command in SQL is used to save all the transaction-related changes permanently to the disk. Whenever DDL commands such as INSERT, UPDATE and DELETE are used, the changes made by these commands are permanent only after closing the current session. So before closing the session, one can easily roll back the changes made by the DDL commands. Hence, if we want the changes to be saved permanently to the disk without closing the session, we will use the commit command.

Syntax:

COMMIT; 

Example:

We will select an existing database, i.e., school.

mysql> USE school;

Rollback command

This command restores the database to last commited state. It is also used with SAVEPOINT command to jump to a savepoint in an ongoing transaction.

If we have used the UPDATE command to make some changes into the database, and realise that those changes were not required, then we can use the ROLLBACK command to rollback those changes, if they were not commited using the COMMIT command.

Following is rollback command’s syntax,

ROLLBACK TO savepoint_name;

SAVEPOINT command

SAVEPOINT command is used to temporarily save a transaction so that you can rollback to that point whenever required.

Following is savepoint command’s syntax,

SAVEPOINT savepoint_name;

In short, using this command we can name the different states of our data in any table and then rollback to that state using the ROLLBACK command whenever required.

Introduction to Data analytics software

Data analysis involves sorting through massive amounts of unstructured information and deriving key insights from it. These insights are enormously valuable for decision-making at companies of all sizes.

A quick note here: data analysis and data science are not the same. Although they belong to the same family, data science is typically more advanced (a lot more programming, creating new algorithms, building predictive models, etc.).

Methods:

Regression analysis: A set of statistical processes that allows you to examine the relationship between two or more variables. Learn more about this method here: Regression Analysis / Data Analytics in Regression.

Cluster analysis: Organizes data into groups, or clusters that share common characteristics. More on this here: Cluster Analysis and Unsupervised Machine Learning in Python.

Factor analysis: Condenses several variables into just a few to make data analysis easier. Learn more: An Introduction to Factor Analysis.

Data mining: The process of finding trends, patterns, and correlations in large data sets. Learn more: Data Mining with R: Go from Beginner to Advanced!

Text analysis: Extract machine-readable information from unstructured text (e.g., PDFs, word processing documents, emails). More on this: Text Analysis and Natural Language Processing With Python.

Types of Data analytics software, Open source and Proprietary software

Open-Source Software

It all started with Richard Stallman who developed the GNU project in 1983 which fueled the free software movement which eventually led to the revolutionary open-source software movement.

The movement catapulted the notion of open-source collaboration under which developers and programmers voluntarily agreed to share their source code openly without any restrictions.

The community of people working with the software would allow anyone to study and modify the open-source code for any purpose they want. The open-source movement broke all the barriers between the developers/programmers and the software vendors encouraging everyone to open collaboration. Finally, the label “open-source software” was made official at a strategy session in Palo Alto, California in 1998 to encourage the worldwide acceptance of this new term which itself is reminiscent of the academic freedom.

The idea is to release the software under the open licenses category so that anyone could see, modify, and distribute the source code as deemed necessary.

It’s a certification mark owned by the Open Source Initiative (OSI). The term open source software refers to the software that is developed and tested through open collaboration meaning anyone with the required academic knowledge can access the source code, modify it, and distribute his own version of the updated code.

Any software under the open source license is intended to be shared openly among users and redistributed by others as long as the distribution terms are compliant with the OSI’s open source definition. Programmers with access to a program’s source code are allowed to manipulate parts of code by adding or modifying features that would not have worked otherwise.

Proprietary software

The term “Proprietary software” refers to the category of software that is protected by copyright laws and must be licensed before it can be used. Most of the time, you have to pay for proprietary software. That is, you will have to pay for its license before you are allowed to use it.

The purpose of proprietary software is not to facilitate any form of cooperative effort. It is developed purely for the purpose of being utilised by the developer as well as any other users who have purchased a license to do so. The access to proprietary software is restricted, in contrast to the open nature of Open Source software. It is only accessible to those who own it and those who were responsible for its development.

The adaptability of the design is also an important aspect to consider. The degree of adaptability offered by proprietary software is significantly lower than that of open-source software. There are limitations placed on how it can be utilised. Copyright protection is applied to proprietary software. In other words, whoever initially created the source code is the owner of any intellectual property rights associated with it.

Due to the fact that it is copyrighted, this software has a limited degree of adaptability. On the other hand, anyone, regardless of their level of expertise, can use proprietary software. This software is not intended for use by the general public, but rather by a select number of individuals who have purchased the rights to use it and who are the sole owners of the source code.

Proprietary software list:

  • Windows
  • Microsoft
  • macOS
  • Adobe Photoshop
  • Adobe Flash Player
  • iTunes

Open-Source Software

Proprietary Software

Open-source software is computer software whose source code is available openly on the internet and programmers can modify it to add new features and capabilities without any cost. Proprietary software is computer software where the source codes are publicly not available only the company which has created can modify it.
Here the software is developed and tested through open collaboration. Here the software is developed and tested by the individual or organization by which it is owned not by the public.
Open-source software can be installed on any computer. Proprietary software can be installed into any computer without a valid license.
Users do not need to have any authenticated license to use this software. Users need to have a valid and authenticated license to use this software.
Open-source software is managed by an open-source community of developers. Proprietary software is managed by a closed team of individuals or groups that developed it.
It is more flexible and provides more freedom which encourages innovation. It is not much flexible so there is a very limited innovation scope with the restrictions.
Users can get open software free of charge. Users must have to pay to get the proprietary software.
Limited Intellectual Property Protections Full Intellectual Property Protections
Usually Developed and Maintained by non-profit organizations. Usually Developed and Maintained by for-profit entities.
Examples are Android, Linux, Firefox, Open Office, GIMP, VLC Media player, etc. Examples are Windows, macOS, Internet Explorer, Google Earth, Microsoft Office, Adobe Flash Player, Skype, etc.
In open-source software the source code is public. In proprietary software, the source code is protected.
In open-source software faster fixes of bugs and better security are availed due to the community. In proprietary software, the vendor is completely responsible for fixing malfunctions.

Dealing with missing or incomplete data

Missing Completely at Random (MCAR): The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables.

Missing at Random (MAR): Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data.

Missing not at Random (MNAR): Two possible reasons are that the missing value depends on the hypothetical value (e.g. People with high salaries generally do not want to reveal their incomes in surveys) or missing value is dependent on some other variable’s value (e.g. Let’s assume that females generally don’t want to reveal their ages! Here the missing value in age variable is impacted by gender variable).

Techniques for Handling the Missing Data

The best possible method of handling the missing data is to prevent the problem by well-planning the study and collecting the data carefully. The following are suggested to minimize the amount of missing data in the clinical research.

First, the study design should limit the collection of data to those who are participating in the study. This can be achieved by minimizing the number of follow-up visits, collecting only the essential information at each visit, and developing the userfriendly case-report forms.

Second, before the beginning of the clinical research, a detailed documentation of the study should be developed in the form of the manual of operations, which includes the methods to screen the participants, protocol to train the investigators and participants, methods to communicate between the investigators or between the investigators and participants, implementation of the treatment, and procedure to collect, enter, and edit data.

Third, before the start of the participant enrollment, a training should be conducted to instruct all personnel related to the study on all aspects of the study, such as the participant enrollment, collection and entry of data, and implementation of the treatment or intervention.

Fourth, if a small pilot study is performed before the start of the main trial, it may help to identify the unexpected problems which are likely to occur during the study, thus reducing the amount of missing data.

Fifth, the study management team should set a priori targets for the unacceptable level of missing data. With these targets in mind, the data collection at each site should be monitored and reported in as close to real-time as possible during the course of the study.

Sixth, study investigators should identify and aggressively, though not coercively, engage the participants who are at the greatest risk of being lost during follow-up.

Finally, if a patient decides to withdraw from the follow-up, the reasons for the withdrawal should be recorded for the subsequent analysis in the interpretation of the results.

It is not uncommon to have a considerable amount of missing data in a study. One technique of handling the missing data is to use the data analysis methods which are robust to the problems caused by the missing data. An analysis method is considered robust to the missing data when there is confidence that mild to moderate violations of the assumptions will produce little to no bias or distortion in the conclusions drawn on the population. However, it is not always possible to use such techniques. Therefore, a number of alternative ways of handling the missing data has been developed.

Listwise or case deletion

By far the most common approach to the missing data is to simply omit those cases with the missing data and analyze the remaining data. This approach is known as the complete case (or available case) analysis or listwise deletion.

Listwise deletion is the most frequently used method in handling missing data, and thus has become the default option for analysis in most statistical software packages. Some researchers insist that it may introduce bias in the estimation of the parameters. However, if the assumption of MCAR is satisfied, a listwise deletion is known to produce unbiased estimates and conservative results. When the data do not fulfill the assumption of MCAR, listwise deletion may cause bias in the estimates of the parameters.

If there is a large enough sample, where power is not an issue, and the assumption of MCAR is satisfied, the listwise deletion may be a reasonable strategy. However, when there is not a large sample, or the assumption of MCAR is not satisfied, the listwise deletion is not the optimal strategy.

Pairwise deletion

Pairwise deletion eliminates information only when the particular data-point needed to test a particular assumption is missing. If there is missing data elsewhere in the data set, the existing values are used in the statistical testing. Since a pairwise deletion uses all information observed, it preserves more information than the listwise deletion, which may delete the case with any missing data. This approach presents the following problems:

1) The parameters of the model will stand on different sets of data with different statistics, such as the sample size and standard errors.

2) It can produce an intercorrelation matrix that is not positive definite, which is likely to prevent further analysis.

Pairwise deletion is known to be less biased for the MCAR or MAR data, and the appropriate mechanisms are included as covariates. However, if there are many missing observations, the analysis will be deficient.

Mean substitution

In a mean substitution, the mean value of a variable is used in place of the missing data value for that same variable. This allows the researchers to utilize the collected data in an incomplete dataset. The theoretical background of the mean substitution is that the mean is a reasonable estimate for a randomly selected observation from a normal distribution. However, with missing values that are not strictly random, especially in the presence of a great inequality in the number of missing values for the different variables, the mean substitution method may lead to inconsistent bias. Furthermore, this approach adds no new information but only increases the sample size and leads to an underestimate of the errors. Thus, mean substitution is not generally accepted.

Regression imputation

Imputation is the process of replacing the missing data with estimated values. Instead of deleting any case that has any missing value, this approach preserves all cases by replacing the missing data with a probable value estimated by other available information. After all missing values have been replaced by this approach, the data set is analyzed using the standard techniques for a complete data.

In regression imputation, the existing variables are used to make a prediction, and then the predicted value is substituted as if an actual obtained value. This approach has a number of advantages, because the imputation retains a great deal of data over the listwise or pairwise deletion and avoids significantly altering the standard deviation or the shape of the distribution. However, as in a mean substitution, while a regression imputation substitutes a value that is predicted from other variables, no novel information is added, while the sample size has been increased and the standard error is reduced.

Last observation carried forward

In the field of anesthesiology research, many studies are performed with the longitudinal or time-series approach, in which the subjects are repeatedly measured over a series of time-points. One of the most widely used imputation methods in such a case is the last observation carried forward (LOCF). This method replaces every missing value with the last observed value from the same subject. Whenever a value is missing, it is replaced with the last observed value.

This method is advantageous as it is easy to understand and communicate between the statisticians and clinicians or between a sponsor and the researcher.

Although simple, this method strongly assumes that the value of the outcome remains unchanged by the missing data, which seems unlikely in many settings (especially in the anesthetic trials). It produces a biased estimate of the treatment effect and underestimates the variability of the estimated result. Accordingly, the National Academy of Sciences has recommended against the uncritical use of the simple imputation, including LOCF and the baseline observation carried forward, stating that:

Single imputation methods like last observation carried forward and baseline observation carried forward should not be used as the primary approach to the treatment of missing data unless the assumptions that underlie them are scientifically justified.

Maximum likelihood

There are a number of strategies using the maximum likelihood method to handle the missing data. In these, the assumption that the observed data are a sample drawn from a multivariate normal distribution is relatively easy to understand. After the parameters are estimated using the available data, the missing data are estimated based on the parameters which have just been estimated.

When there are missing but relatively complete data, the statistics explaining the relationships among the variables may be computed using the maximum likelihood method. That is, the missing data may be estimated by using the conditional distribution of the other variables.

Expectation-Maximization

Expectation-Maximization (EM) is a type of the maximum likelihood method that can be used to create a new data set, in which all missing values are imputed with values estimated by the maximum likelihood methods. This approach begins with the expectation step, during which the parameters (e.g., variances, covariances, and means) are estimated, perhaps using the listwise deletion. Those estimates are then used to create a regression equation to predict the missing data. The maximization step uses those equations to fill in the missing data. The expectation step is then repeated with the new parameters, where the new regression equations are determined to “fill in” the missing data. The expectation and maximization steps are repeated until the system stabilizes, when the covariance matrix for the subsequent iteration is virtually the same as that for the preceding iteration.

An important characteristic of the expectation-maximization imputation is that when the new data set with no missing values is generated, a random disturbance term for each imputed value is incorporated in order to reflect the uncertainty associated with the imputation. However, the expectation-maximization imputation has some disadvantages. This approach can take a long time to converge, especially when there is a large fraction of missing data, and it is too complex to be acceptable by some exceptional statisticians. This approach can lead to the biased parameter estimates and can underestimate the standard error.

For the expectation-maximization imputation method, a predicted value based on the variables that are available for each case is substituted for the missing data. Because a single imputation omits the possible differences among the multiple imputations, a single imputation will tend to underestimate the standard errors and thus overestimate the level of precision. Thus, a single imputation gives the researcher more apparent power than the data in reality.

Multiple imputations

Multiple imputation is another useful strategy for handling the missing data. In a multiple imputation, instead of substituting a single value for each missing data, the missing values are replaced with a set of plausible values which contain the natural variability and uncertainty of the right values.

This approach begin with a prediction of the missing data using the existing data from other variables. The missing values are then replaced with the predicted values, and a full data set called the imputed data set is created. This process iterates the repeatability and makes multiple imputed data sets (hence the term “multiple imputation”). Each multiple imputed data set produced is then analyzed using the standard statistical analysis procedures for complete data, and gives multiple analysis results. Subsequently, by combining these analysis results, a single overall analysis result is produced.

The benefit of the multiple imputation is that in addition to restoring the natural variability of the missing values, it incorporates the uncertainty due to the missing data, which results in a valid statistical inference. Restoring the natural variability of the missing data can be achieved by replacing the missing data with the imputed values which are predicted using the variables correlated with the missing data. Incorporating uncertainty is made by producing different versions of the missing data and observing the variability between the imputed data sets.

Multiple imputation has been shown to produce valid statistical inference that reflects the uncertainty associated with the estimation of the missing data. Furthermore, multiple imputations turns out to be robust to the violation of the normality assumptions and produces appropriate results even in the presence of a small sample size or a high number of missing data.

With the development of novel statistical software, although the statistical principles of multiple imputation may be difficult to understand, the approach may be utilized easily.

Sensitivity analysis

Sensitivity analysis is defined as the study which defines how the uncertainty in the output of a model can be allocated to the different sources of uncertainty in its inputs.

When analyzing the missing data, additional assumptions on the reasons for the missing data are made, and these assumptions are often applicable to the primary analysis. However, the assumptions cannot be definitively validated for the correctness. Therefore, the National Research Council has proposed that the sensitivity analysis be conducted to evaluate the robustness of the results to the deviations from the MAR assumption.

Data Integration: What, Need, Advantages, Approaches of Data integration

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume (that is, big data) and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

The data integration approaches are formally defined as triple <G, S, M> where,

G stand for the global schema,

S stands for the heterogeneous source of schema,

M stands for mapping between the queries of source and global schema.

Advantages of Data integration

Increased Revenue Potential

Data integration can help you create new revenue streams or expand into new markets by allowing access to more information in your organization faster.

Cost Reduction

Data integration reduces the need for manual tasks by allowing more steps in your process to become automated. Manual processes are time-consuming, expensive, and prone to human error so using automated data integration can reduce costs by eliminating these workflows.

Improved Efficiency

Through data integration, you are able to automate the steps in your existing processes, which allows employees to focus on more complex work. This increased efficiency can allow you to provide better results for your customers while also increasing employee satisfaction.

Improved Decision Making

Having access to all relevant information about your company or industry can help you make better decisions as a manager, marketer, or entrepreneur.

Improved Data Quality

With the right data integration software, you are able to automatically validate incoming information as well as make changes to existing records. This approach provides more accurate data without requiring employees to spend a lot of time on data entry.

Improved Customer Experience

A simple way to increase customer satisfaction is through providing them access to the information they need as quickly as possible without requiring employees to manually respond to each request.

By integrating your data you can automatically provide customers with deeper insights in a way that’s personalized to their needs. This approach makes the customer experience more efficient and satisfying for both parties.

Increased Innovation

Innovation is an important part of any business strategy but it’s difficult to introduce new concepts when you don’t have access to the information you need.

By integrating your data and allowing employees the ability to easily create new reports, dashboards, and visualizations you are able to provide a faster platform for innovation that allows employees to be more creative.

In addition, because everyone has access to this information it can be used as the basis for discussion, which increases collaboration across departments.

Stronger Customer Relationships

In today’s business world, it is important for companies to have a strong understanding of customer needs and preferences in order to provide them with better products and services.

With data integration, you are able to automatically create a historical record that can be used in conjunction with demographic information to gain a deeper understanding of your customer base and provide them with more personalized insights.

Improved Security

Data integration helps to improve security by streamlining the process of managing user permissions and access to those who need it.

Approaches of Data integration

Middleware Data Integration

In this method of data integration, middleware or software is used to connect applications and transfer the data to databases. It is very handy while integrating legacy systems with newer ones.

Pros:

  • Better data streaming
  • Easier access between systems

Cons:

  • Less access
  • Limited functionality

Manual Data Integration

Manual data integration is the process of integrating all the different data sources without any automation. This is usually done by data managers using custom code and is a great strategy for one-time instances.

Pros:

  • Reduced costs
  • More freedom

Cons:

  • Greater room for error
  • Difficult to scale

Application-Based Integration

In this method, software applications do all the work – locate, retrieve and integrate data from different sources and systems. This strategy is great for businesses that work in hybrid cloud environments.

Pros:

  • Easier information exchange
  • Simplified process

Cons:

  • Limited access
  • Inconsistent results
  • Complicated setup

Uniform Access Integration

This method integrates data from multiple, disparate sources and presents it uniformly. Another useful feature of this method is that it allows the data to stay in its original location while doing this. This technique is an optimal approach for organizations that need access to multiple, disparate systems without the cost of creating a copy of the data.

Pros:

  • Low storage requirements
  • Easier access
  • A simplified view of data

Cons:

  • Strained systems
  • Data integrity challenges

Common Storage Integration

This method is similar to uniform access integration, except that it creates a copy of the data in a data warehouse. This is certainly the best approach for businesses who want to make the most out of their data.

Pros:

  • Increased version control
  • Reduced burden
  • Enhanced data analytics
  • Cleaner data

Cons:

  • High storage costs
  • High maintenance costs

Types of Digital Data: Structured, Semi Structured, Unstructured Data

Structured data:

Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in a table with rows and columns. They have relational keys and can easily be mapped into pre-designed fields. Today, those data are most processed in the development and simplest way to manage information. Example: Relational data.

Semi-Structured data:

Semi-structured data is information that does not reside in a relational database but that has some organizational properties that make it easier to analyze. With some processes, you can store them in the relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist to ease space. Example: XML data.

Unstructured data:

Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model, thus it is not a good fit for a mainstream relational database. So for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs.

Structured Data Semi Structured Data Unstructured Data
Level of organizing Structured Data as name suggest this type of data is well organized and hence level of organizing is highest in this type of data. On other hand in case of Semi Structured Data the data is organized up to some extent only and rest is non-organized hence the level of organizing is less than that of Structured Data and higher than that of Unstructured Data. In last the data is fully non organized in case of Unstructured Data and hence level of organizing is lowest in case of Unstructured Data.
Transaction Management In Structured Data management and concurrency of data is present and hence mostly preferred in multitasking process. In Semi Structured Data transaction is not by default but is get adapted from DBMS but data concurrency is not present. While in Unstructured Data no transaction management and no concurrency are present.
Flexible and Scalable As Structured Data is based on relational database so it becomes schema dependent and less flexible as well as less scalable. While in case Semi Structured Data data is more flexible than Structured Data but less flexible and scalable as compare to Unstructured Data. As there is no dependency on any database so Unstructured Data is more flexible and scalable as compare to Structured and Semi Structured Data.
Performance In Structure Data we can perform structured query which allow complex joining and thus performance is highest as compare to that of Semi Structured and Unstructured Data. On other hand in case of Semi Structured Data only queries over anonymous nodes are possible so its performance is lower than Structured Data but more than that of Unstructured Data While in case of Unstructured Data only textual query are possible so performance is lower than both Structured and Semi Structured Data.
Means of Data Organization Structured Data is get organized by the means of Relational Database. While in case of Semi Structured Data is partially organized by the means of XML/RDF. On other hand in case of Unstructured Data data is based on simple character and binary data.
Technology It is based on Relational database table It is based on XML/RDF(Resource Description Framework). It is based on character and binary data
Transaction management Matured transaction and various concurrency techniques Transaction is adapted from DBMS not matured No transaction management and no concurrency
Versioning As mentioned in definition Structured Data supports in Relational Database so versioning is done over tuples, rows and table as well. On other hand in case of Semi Structured Data versioning is done only where tuples or graph is possible as partial database is supported in case of Semi Structured Data. Versioning in case of Unstructured Data is possible only as on whole data as no support of database at all.

error: Content is protected !!