Open Close

Top 5 Big Data Vulnerability Classes


Top 5 Big Data Vulnerability ClassesRecently, we were pentesting a Data mining and Analytics company. The amount of data that they talked about is phenomenal and they are planning to move to Big Data. They invited me to write a blog on state of the art, Big Data security concerns and challenges and I happily accepted.

Key Insights on Existing Big Data Architecture

Big data is fundamentally different from traditional relational databases in terms of requirements and architecture. Big data is often characterized by 3Vs, Volume, Velocity and Variety of data. Some of the fundamental differences in Big Data architecture are as follows:

  • Distributed Architecture: Big data architecture is highly distributed on the scale of 1000s of data and processing nodes. Data is horizontally partitioned, replicated and distributed among multiple data nodes available. As a result, Big Data architecture is generally highly resilient and fault tolerant.
  • Real-Time, Stream and Continuous Computations: Performing computation real-time and continuously is next trend in Big Data apart from Batch processing model as supported by Hadoop.
  • Ad-hoc Queries:  Big data enables Knowledge Workers to create and execute data analyzing queries on the fly.
  • Parallel and Powerful Programming Language: The computations performed in Big Data are much more complex, highly parallel and computationally intensive than traditional SQL / PLSQL queries. For example, Hadoop uses MapReduce framework to perform computations on data processing nodes. MapReduce programs are written in Java.
  • Move the code: In Big Data, it is easy to move the code, rather than data.
  • Non Relational Data: Migrating tremendously from traditional relational databases, the data stored in Big Data is non relational. The main advantage of non relational data is that it can accommodate large volume and variety of data.
  • Auto-tiering: In Big Data, hottest data blocks are tiered into higher performance media, while the coldest data is sent to lower cost high capacity drives. As a result, it is extremely difficult to know precisely where the data is exactly located among the available data nodes.
  • Variety of Input Data Sources: Big Data requires collecting data from many sources such as logs, end to point devices, social media etc.

(Read More:  9 Questions to ask your Application Security Testing Vendor!)

Finally, there is no silver bullet in Big Data in terms of data model. Hadoop is already outdated and unsuitable for many Big data problems. Some of the emerging Big data solutions are following:

  • For Real-time analytics: Cloudscale, Storm
  • For Graph Computation: Giraph and Pregel (Some examples graph computation are Shortest Paths, Degree of Separation etc.)
  • For low latency queries over very large data set: Dremel and so on.

Top 5 Big Data Vulnerability Classes

1. Insecure Computation

There are many ways an insecure program can create big security challenges for a big data solution including:

  • An insecure program can access sensitive data such as personal profile, age credit cards etc.
  • An insecure program can corrupt the data leading to incurrent results.
  • An insecure program can perform Denial of Service into your Big Data solution leading to financial loss.

2. End-point input validation/filtering

Big data collects data from variety of sources. There are two fundamental challenges in data collection process:

  • Input Validation: How can we trust data? What kind of data is untrusted? What are untrusted data sources?
  • Data Filtering: Filter rogue or malicious data.

Free Research Report:  How secure are the Security Products?

The amount of data collection in Big Data makes it difficult to validate and filter data on the fly.

The behavior aspect of data poses additional challenges in input validation and filtering. Traditional Signature based data filtering may not solve the input validation and data filtering problem completely. For example a rogue or malicious data source can insert large legitimate but incorrect data to the system to influence prediction results.

3. Granular access control

Existing solutions of Big Data are designed for performance and scalability, keeping almost no security in mind. Traditional relational databases have pretty comprehensive security features in terms of access control in terms users, tables and rows and even at cell level. However, many fundamental challenges prevent Big Data solutions to provide comprehensive access control:

  • Security of Big Data is still an ongoing research.
  • Non relational nature of data breaks traditional paradigm of table, row or cell level of access control. Current NoSQL databases dependents on 3rd party solutions or application middleware to provide access control.
  • Ad-hoc Queries poses additional challenge wrt to access control. For example, imagine end user could have submitted legitimate SQL queries to Relational Databases.
  • Access control is disabled by default.

4. Insecure data storage and Communication

There are multiple challenges related to data storage and communication in Big Data:

  • Data is stored at various Distributed Data Nodes. Authentication, authorization and Encryption of data is challenge at each node.
  • Auto-tiering: Auto partitioning and moving of data can save sensitive data on a lower cost and less sensitive medium.
  • Real Time analytics and Continuous computation requires low latency with respect to queries and hence encryption and decryption may provide additional overhead in terms of performance.
  • Secure communication among nodes, middlewares and end users is another area of concern.
  • Transactional logs of big data is another big data itself and should be protected same as data.

(Read More:  Web Application Scanner: How should you benchmark?)

5. Privacy Preserving Data Mining and Analytics

Monetization of Big data generally involves doing data mining and analytics. However, there are many security concerns pertaining to monetizing and sharing big data analytics in terms of invasion of privacy, invasive marketing, and unintentional disclosure of sensitive information, which must be addressed.

For example, AOL released anonymized search logs for academic purposes, but users were easily identified by their searchers. Netflix faced a similar problem when users of their anonymized data set were identified by correlating their Netflix movie scores with IMDB scores.




  1. Hellen says:

    Hi there mates, fastidious piece of writing and nice urging commented here,
    I am really enjoying by these.

  2. mgsecurity says:

    Because the admin of this website is working, no uncertainty
    very shortly it will be well-known, due to its quality

  3. Anonymous says:

    It’s actually a great and useful piece of info.
    I’m satisfied that you shared this useful info with us.

    Please stay us informed like this. Thank you for sharing.

  4. It’s amazing to go to see this site and reading the views of all colleagues about this paragraph, while I am also zealou
    of getting experience.

  5. For the reason that the admin of this site is working, no uncertainty
    very soon it will be renowned, due to its feature contents.

  6. Mike says:

    Good post. I learn something new and challenging on blogs
    I stumbleupon on a daily basis. It’s always helpful to read through content from other writers and practice a little
    something from other web sites.

  7. Neilson says:

    Excellent post. I absolutely love this website.
    Keep writing!

  8. American Smoke benefits says:

    Thank you for another informative site. The place else could
    I get that kind of information written in such an ideal way?

    I have a challenge that I am just now running on, and I’ve been at the glance out for such info.

  9. Katie says:

    You’re so awesome! I don’t suppose I’ve read through something like that before.

    So wonderful to find someone with unique thoughts on this subject.
    Really.. many thanks for starting this up.
    This website is one thing that is required on the web, someone with some originality!

  10. Bob says:

    What a data off un-ambiguity and perverseness of precious familiarity concerning
    unexpected feelings.

  11. Neal says:

    Hi there, I found your blog by means of Google whilst searching for a similar matter, your site got here up, it seems good.
    I’ve bookmarked it in my google bookmarks.

  12. Joe Pantel says:

    Thanks for helping out, good info .

Leave a comment

All fields marked (*) are required