Tuesday, November 15, 2011

Assign base numbers to large set of files using powershell

I had a task of assigning base numbers to a large set of files (5000+ in my case). My case was to sort the files based on creation time and add numbers. I had to format numbers to have same length. Thanks to powershell I could do it in no time. Here is the script


gci | Sort-Object creationTime | foreach-object -Process{$num++;$numStr=$num.ToString("0000"); rename-item $_ "$numStr$_"}

Wednesday, October 26, 2011

Powershell rename file using regex

I had a task where I need to update file names of thousands of files. My problem was to rename files of form to '[0-9]{5} .*' to '[0-9]{5}.*' (no white space between base number and file name). Thanks to great powershell, I could finish it in couple of minutes. I used grouping and back references for this task. Here is the solution.





 gci | Rename-Item -NewName {$_.Name -replace '([0-9]{5}) (.*)','$1$2'}



Sunday, May 22, 2011

What is NoSQL????

NoSQL
NoSQL [1] is a non-relational, distributed and horizontally scalable database. NoSQL is also referred as "NoREL" as it is opposes relational model but uses SQL to query a database. In academia, NoSQL is often referred as "Structured Storage" [2].  Some implementations of NoSQL implementations are

  • Google's Bigtable [3]

  • Amazon's Dynamo [4]

  • Apache's Cassandra [6] (used by Facebook)

  • Apache's Hbase [5] (an open source implementation of Bigtable) etc.


NoSQL movement started to overcome the limitations of relational databases.  The three most common limitations of relational databases are

  1. Scalability: Relational databases have a fixed schema and are scalable only in vertical direction i.e., we can scale a relational database by adding more records to tables. The performance of a scaled database can be improved by adding more expensive hardware to the system like adding more CPUs, memory etc. There exists a threshold point beyond which a system cannot be scaled (because of the existing technology and cost). So, there is need for distribution of these relational databases across multiple machines. But, the distributed relational databases do not work properly as they are not designed to work over distributed data.

  2. Availability: Relational databases are usually deployed on single machine and if that machine goes down then the entire system goes down.

  3. Complexity: Relational databases organize data in the form of tables and records. It is assumed that each record in a group has the same schema and if a record fails to satisfy the schema constraints then it cannot be inserted, which is a major drawback. Also, extending a group with additional data would result in either adding columns to existing table or creating a new table to contain the additional information.  Adding additional columns to the table results in wastage of the memory and creating a new table weakens the performance.


To address the limitations of relational databases, software engineers have developed NoSQL databases. NoSQL databases are developed to satisfy the requirements of the application which cannot be fulfilled by a relational database. NoSQL databases are developed specific to the requirements of the application. So, there is no general purpose NoSQL database (like Oracle or postgresql etc., for a SQL databases). NoSQL [7] databases can be broadly classified into three types.

  1. Key-Value stores: Key-Value stores [7] are map like databases where key is the index and value is the value of the index.  Examples of the Key-Value stores are Amazon's SimpleDB, Uppsala University's Amos II etc. These databases can store structured or unstructured data.

  2. Column-oriented databases: In Column-oriented databases [7], we can extend a record by one column of a closely related group. Examples of the Column oriented databases are Google's Bigtable, Apache's Hbase and Facebook's Cassandra.

  3. Document-based databases: Document-based databases [7] stores data in the form of document collections. There are no limitations to the scalability both in terms of horizontal and vertical scalability. Examples of Document-based databases are Apaches' CouchDB, 10gen's MongoDB etc.


CAP Theorem
CAP Theorem [11] states that Consistency, Availability, Partition tolerance cannot be achieved all together at same time for a distributed system. Same is true for NoSQL. Some implementations of NoSQL (e.g., Cassandra) offer four levels of consistency.

  1. Zero: No consistency.

  2. One: Consistency is ensured on one node (a backup machine).

  3. Quorum: Consistency is ensured over (n/2 +1) nodes.

  4. All: Consistency is ensured over all the nodes. The major problem with this level is if one node goes down then consistency will never take place as systems waits for a signal from the node which is down and thus entering into an infinite wait state.


Querying NoSQL
As NoSQL databases are implemented specific to an application, querying language depends on the implementation. Most NoSQL databases use an API to query its database. Some of these NoSQL databases (esp. Cassandra) comes with an API called Thrift which is used to generate api to pull/push data. In addition, some  (esp. Google's Bigtable and Hbase) uses a programming model called MapReduce [12] to perform computation which consume good amount of time. MapReduce is used to specify a map function which generates a set of intermediate key/value pairs and this intermediate set is fed to a reduce/fold function which merges this intermediate set.

Advantages of NoSQL
There are many advantages of NoSQL. The one major advantage of NoSQL is Horizontal Scalability.
Horizontal Scalability: Horizontal Scalability is performed in two ways

Functional Scaling: Functional Scaling [9] groups data using functions and spreads these functional groups across distributed databases. These functional groups may be further classified into sub functional groups. For example, users group is a group which can be classified into male group or female group.
Sharding: Sharding [10] is a process of storing the data of a functional group across databases in the form of chunks. These chucks are referred as "Shards". The advantages of Sharding are high availability, more write bandwidth, high throughput etc.

ACID vs. BASE
Relational databases achieve reliability, consistency etc., with ACID [8] properties. ACID stands for Atomicity, Consistency, Isolation and Durability. Similarly, for NoSQL there are BASE [9] properties. BASE stands for Basically Available, Soft state, Eventually Consistent. Unlike ACID which ensures consistency at the end of every operation, BASE ensures the consistency over the flow of the operations. BASE ensures that the databases are at least 80% consistent at any given instant.

Disadvantages of NoSQL
The three major drawbacks of NoSQL are

Reliability: most NoSQL databases are not 100% reliable.
Consistency: most NoSQL databases are not 100% consistent. There might be situations where database can enter an inconsistent state.
No-joins: most NoSQL databases cannot perform joins.

References
1. NoSQL: http://en.wikipedia.org/wiki/NoSQL.
2. Structured Storage: http://en.wikipedia.org/wiki/Structured_storage.
3. Bigtable:http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/bigtable-osdi06.pdf.
4. Dynamo: http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/Dynamo.pdf.
5. Hbase: http://hadoop.apache.org/hbase/.
6. Cassandra: http://cassandra.apache.org/.
7. NoSQL: http://www.leavcom.com/pdf/NoSQL.pdf.
8. ACID: http://en.wikipedia.org/wiki/ACID.
9. BASE: http://queue.acm.org/detail.cfm?id=1394128.
10. Sharding: http://highscalability.com/unorthodox-approach-database-design-coming-shard.
11. CAP Theorem: http://people.csail.mit.edu/sethg/pubs/BrewersConjecture-SigAct.pdf.
12. Map Reduce: http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/mapreduce-osdi04.pdf