The simplest way for archiving and searching emails

01

In this article/post, I want to share with the entrepreneur community a new business idea that I have in mind in order to get feedbacks, comments and thoughts (I would really appreciate it). As a software and enterprise architect, I´m always designing simple, usable (functional) and pretty (good user experience from the aesthetic view) architecture solutions and refactoring to optimize the architecture of legacies. I´m a fond of optimization problems (research and practice), specifically in the topics of high performance for storage, indexing and processing. Applying the customer development and lean startup concepts, my vision is to (re-)segment the problem/solution space of data archiving toward the email archiving. So, I´m thinking to make my contribution to the computer world by providing an optimized email archiving solution in terms of storage cost effectiveness and simple user interface for organizing and searching emails very easily (this is my unique value proposition). I´m guessing that the target market (customer segment in the business model canvas) are personal-, small-, and medium- business which mainly need to store/archive the emails for legal and particular purposes in a large period of time. I want to name this service as Archivrfy.

Business transactions not only can involve enterprise applications and OLTP systems but also a bunch of emails, for example to register contracts, sales conversations and agreements as well as for evidence of invoices and payment information. Most companies underestimate the effort for the maintanance of traditional tape backups and data indexing to simplify searches.

In order to protect and retain mission critical data contained in emails; companies need a new and very simple approach to easily capture, preserve, search and access emails. For example, the legal department and people see e-mail as an essential factor in its discovery response strategy to present them as evidence in a court of law. The volume of email being generated every day is becoming a huge problem, so organizations and people can free its storage space by moving emails to archiving vaults.

You can archive emails at low cost for reducing local storage space and complying with legal requirements. For security, data is encrypted while in transit and preserve in the final vaults using adjustable retention policy. It´s based on cloud computing technology and available always with 99.99%. It´s scalable with unlimited growth (depending on the available resources in the cloud).

You can use a very simple and fast user experience to search (supporting eDiscovery scenarios) and access to the archived emails for improving productivity. It´s based on an optimized search algorithm by enriching the content with metadata to organizing and extending dynamically the search criteria. As well as it´s a flexible tecnology for simple integration with existing platforms and exporting to several portable email files.

In order to support the previous scenearios using a solution with scalability, high availability and high performance in mind; we need to design a robust architecture for Archivrfy service as shown below.

02

In this article, I´ve share my vision of an email archiving solution in terms of the unique value proposition and the underlying technical architecture.

I really appreciate any thoughts and feedback to help me improve my business vision.

I look forward to hearing from you … very soon

Architecture patterns for scalability in the cloud

Scalability is the ability of system (system==software in the case of computer science) for growing without degradation. That is, if the amount of request (work) increases significatively, then the quality aspects of the software system, particularly the performance, are not impacted negatively; so we add more computing resources easily.

Common scalability scenario are request growing over 30% a month, 500 million page requests a day, a peak rate of ~40k requests per second and ~3TB of new data to store a day.

So it’s important to architect/design system with the scalability in mind in advanced because this is a very important aspect and sometimes it’s very costly to fulfill late in the product lifecycle. Today, scalability is not a difficult constraint because we can grow, in theory, unlimited and very cheaply using cloud computing resources; and the only requirement is to follow a good architecture in mind.

In this article, I will cover several cloud architecture patterns to support scalability. I will follow a logical evolution of the software architecture according to the level of services provided by the product when the workload increases. In order to make concrete the architecture, I will provide examples based on Amazon AWS.

The simplest architecture pattern is to have a web server/application server software stack and database server in the same node (AWS EC2) and database backups are done to a high available storage medium (AWS S3). This infrastructure is in the same availability zone or data center.

01

Next architecture pattern is for improving the availability, scalability and performance of the system. It´s based on the idea of separation of dynamic-generated data from static data and the underlying processing flow. We need to create/use a virtual volume for data storage (AWS ESB) to store dynamic data (mainly for relational databases) independently of the node (AWS EC2) of the web/application server stack and database server. In the case of failure of the main node, we can instantiate another node (AWS EC2) with the same configuration (web/application/database servers) and mount the dynamic data volume to this new node. Static data is stored in a high available storage medium (AWS S3). We also have to make database and logs rotation backups as well as volume snapshot to the high available storage medium (AWS S3). This infrastructure is in the same availability zone or data center.

02

Next architecture pattern is for mainly improving performance and availability of the system. It´s based on the idea of setting a content delivery network (AWS CloudFront) for static data (text, graphics, scripts, media files, documents, etc) in order to make closer the data to the final user as well as caching the mostly used, so reducing the latency and throughput when serving content. AWS CloudFront technology is distributed in servers around the world. We need to register two domains: one for accessing dynamic data and the other to accessing static data. This infrastructure for (web/application/database server) is in the same availability zone or data center while AWS CloudFront servers are in different data centers across the world.

03

Next architecture pattern is for improving the availability, scalability, reliability, security and performance of the system. It´s based on the idea of separation of architecture artifacts and the underlying processing nodes according to their concerns in the whole picture. In order to achieve this goal, we have a multi-layer architecture as described below:

  • Front-end layer is the only layer facing to the final user. In this layer, we have our public IPs and registered domain names (AWS Route53 or other DNS provider) as well as the load balancers configured (AWS ELB or other technology such as HAProxy hosted in AWS EC2). Requests are incoming to this point using a secure channel (SSL to protect the data in transit by Internet) and the processing flow is balanced/forwarded according to the workload of the back-end application servers (achieving high availability, scalability and high performance). Load balancers are in different sub-networks than the back-end servers (possible separated by network elements such as routers and firewalls), so if an intruder breaks this layer cannot proceed to the inside (achieving security)
  • Application server layer. This layer is the first in the back-end and mainly hosts the farm of web/application servers for processing the requests sent by the load balancers (achieving high availability, scalability and high performance). For request processing, we select a huge variety of web framework stack. For example, a common platform stack might be Apache HTTPD web server in the front serving HTTP requests plus Tomcat application server(s) as servlet container processing business logic. We can also improve the performance in this layer by adding caching technology (AWS Elastic Cache or other technology such as memcached) for caching/storing in memory master data and (pre-)processed results in order to alleviate recurring processing and database server workload. We can also use database connection pool technologies (improving performance) such as Pgbouncer for maintaining open connections (note: it´s very expensive to open connections to database servers) to PostgreSQL servers. I recommend for these servers (nodes in the web farm) to use AWS High CPU Extra Large EC2 machines
  • The storage system layer. It´s where our data is persisted and comprises the relational database servers (RDBMS) and storage platform. In order to process a huge amount of transactions from the application layer, we have established a master-slave scheme for the RDBMS, so we have one active (master) server serving the transaction requests and replicating the changes to the slave servers (one or more) at a reasonable frequency to avoid outdated data (achieving high availability, scalability and high performance). The master-slave scheme is well supported by several RDBMS such as Oracle, SQL Server and PostgreSQL. Another configuration to improve the performance is to enable the applications send their database write requests to the master database, while the read requests are directed to a load balancer, which distributes those read requests to a pool of slave databases (note: for applications that rapidly write and then read the same data object, this may not be the most effective method of database scaling). It´s remarkable to say that a master-master scheme is not a good scalable solution because in a multiple master databases, with each master has the ability to modify any data object within the database using distributed transactions which is very costly and locks a lot objects or transaction replications with a latency between masters, and it is extremely difficult to ensure consistency between masters, and thus the integrity of the data. The database server instances must be AWS High Memory Extra Large machines in order to support the workload. And finally, for the storage platform, the idea is to have different volumes (AWS ESB) for each kind of data objects (transactional data, partitioning and sharding data, master data, indexes, transaction logs, external file, etc). For transactional data, we need to provision high level IOPS to improve the requests to data. Static data is stored in a high available storage medium (AWS S3). We also have to make database and logs rotation backups as well as volume snapshot to the high available storage medium (AWS S3). It´s remarkable to format the volume using XFS for making easy the creation of the snapshot and improving performance of the file system

In the general way, we need to separate the dynamic-generated data and static data. The static data is served using CDN mechanisms to reduce the latency and throughput. In order to implement disaster recovery mechanisms and improve the system availability, the idea is to distribute the nodes (load balancer, web/application servers, database servers and memcache servers) in different availability zones or data centers across the world.

The architecture vision is illustrated in the following figure.

04

Another scalability technique at the database level is data sharding. A database shard is a horizontal partition in a database, that is, take a large database, and break it into a number of smaller databases across servers. This design principle (horizontal partition) whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server.

Let´s illustrate this concept as shown below. The primary shard table is the customer entity. The customer table is the parent of the shard hierarchy, with the customer_order and order_item entities as child tables. The global tables are the common lookup tables, which have relatively low activity, and these tables are replicated to all shards to avoid cross-shard joins. There are design concerns when you architect your data shards:

  • Generate and assign unique id to each piece of data
  • Shard id based on least used shard and the shardId is embedded in the primary key itself
  • Whenever an object needs to be looked up we parse the objectId and get the logical shard id, from this  we lookup the physical shard id and get a connection pool and do the query there. For example: ObjectId= 32.XXXXXXX maps to logical shard 32 and logical shard 32 lives physically on shard 4 and shard 4 lives physically on host2, schema5

05

In this article, I´ve covered the key principles, techniques and architectures related to scalability of software system, specifically from the cloud computing perspective in order to take advantage of this emerging/well-established technology that fits very well when growing our business and the underlying applications.

Running Windows Instances on Amazon EC2

In this article, I want to explain step by step how to create and run Windows instances on Amazon EC2.

In order to create the Windows instance. Open the AWS Management Console and select the EC2 tab at https://console.aws.amazon.com/console/home. Select the desired region to launch the instance (red circle) and click on the Lauch Instance button (blue circle) in the following figure.

Figure 1

Next select the Microsoft Windows Server 2008 R2 with SQL Server Express and IIS (see Figure 2).

Figure 2

Next page is for setting the instance details such as number of instances, the instance type and availability zone as shown in the following figure.

Figure 3

After that, we have a set of advanced setting for the instances configuration. When you reach the key pairs page, you have to click on the Create and Download your Key Pair and save the myfirst_instance_keypair.pem file in your local machine. You need a key pair (public/private) to be authenticated when logging using ssh. The public key is retained in the instance (copied into the .ssh/authorized_keys file in the primary user account´s home directory), allowing you to use the private key (downloaded to your local machine) to log in securely without a password. You can have several pair keys, and each key pair requires a name.

Figure 4

Next page allows you creating security groups to define your instance firewall. By default, you will leave open for the outsiders the HTTP (80) and Remote Desktop (3389) ports . Enter a name and description for the security group (blue circle in the following figure)

Figure 5

And finally, you´re able to review all your inputs regarding the instance and click on the button Launch (see Figure 6).

Figure 6

Once the instance is running OK, you might connect using Remote Desktop. One thing to note is that you need to locate the myfirst_instance_keypair.pem to get the Windows Administrator´s password.

In order to get the Windows Admininstrator´s password, you need to select the instance and select the “Get Windows Admin Password” option (see Figure 7).

Figure 7 

After that, the Retrieve Default Windows Administrator Password pop-up is open, and you need to open your key pair file and copy the entire content in the Private Key text field and finally click on the Decrypt Password button as show in the Figure 8.

Figure 8

Now, we need to open Remote Desktop Connection program and start logging to the instance.

In this article, I´ve shown how to provision an EC2 Windows instance.