IGNOU MCS-221 Solved Question Paper PDF Download

The IGNOU MCS-221 Solved Question Paper PDF Download page is designed to help students access high-quality exam resources in one place. Here, you can find ignou solved question paper IGNOU Previous Year Question paper solved PDF that covers all important questions with detailed answers. This page provides IGNOU all Previous year Question Papers in one PDF format, making it easier for students to prepare effectively.

IGNOU MCS-221 Solved Question Paper in Hindi
IGNOU MCS-221 Solved Question Paper in English
IGNOU Previous Year Solved Question Papers (All Courses)

Whether you are looking for IGNOU Previous Year Question paper solved in English or ignou previous year question paper solved in hindi, this page offers both options to suit your learning needs. These solved papers help you understand exam patterns, improve answer writing skills, and boost confidence for upcoming exams.

IGNOU MCS-221 Solved Question Paper PDF

IGNOU Previous Year Solved Question Papers

This section provides IGNOU MCS-221 Solved Question Paper PDF in both Hindi and English. These ignou solved question paper IGNOU Previous Year Question paper solved PDF include detailed answers to help you understand exam patterns and improve your preparation. You can also access IGNOU all Previous year Question Papers in one PDF for quick and effective revision before exams.

IGNOU MCS-221 Previous Year Solved Question Paper in Hindi

Q1. (a) एक ब्लॉक डायग्राम की मदद से, डेटा वेयरहाउस की एक्सट्रेक्ट, ट्रांसफॉर्म और लोडिंग (ETL) प्रक्रिया की व्याख्या करें। यह भी चर्चा करें कि यह ELT प्रक्रिया से कैसे अलग है। (b) एक उदाहरण की मदद से, नॉइजी डेटा की समस्या को हल करने के लिए बिनिंग विधि की व्याख्या करें। (c) निम्नलिखित टेक्स्ट ट्रांसफॉर्मेशन तकनीक पर चर्चा करें जो टेक्स्ट वाक्यों को न्यूमेरिक वैक्टर में बदलने में मदद करती है: “बैग-ऑफ-वर्ड्स (BoW)”। एक उदाहरण भी दें। BoW का उपयोग करने के नुकसानों का उल्लेख करें। (d) K-नियरेस्ट नेबर्स एल्गोरिथम लिखें और समझाएं। इसके फायदे और नुकसान का भी उल्लेख करें।

Ans.

(a) एक्सट्रेक्ट, ट्रांसफॉर्म और लोड (ETL) प्रक्रिया

ETL एक डेटा इंटीग्रेशन प्रक्रिया है जो कई स्रोतों से डेटा एकत्र करती है, इसे एक सुसंगत और प्रयोग करने योग्य प्रारूप में बदलती है, और फिर इसे एक गंतव्य सिस्टम, आमतौर पर एक डेटा वेयरहाउस, में लोड करती है। यह डेटा वेयरहाउसिंग का एक मूलभूत घटक है।

ETL प्रक्रिया के चरण:

एक्सट्रेक्ट (Extract): इस चरण में, विभिन्न स्रोतों से डेटा निकाला जाता है। ये स्रोत रिलेशनल डेटाबेस (जैसे Oracle, SQL सर्वर), फ्लैट फाइलें (जैसे CSV, XML), NoSQL डेटाबेस या API हो सकते हैं। डेटा को स्रोत सिस्टम पर न्यूनतम प्रभाव के साथ कुशलतापूर्वक निकाला जाता है।
ट्रांसफॉर्म (Transform): यह सबसे जटिल चरण है। निकाले गए कच्चे डेटा को साफ, मान्य और रूपांतरित किया जाता है। सामान्य परिवर्तनों में शामिल हैं:
- क्लीनिंग: असंगत डेटा को ठीक करना या हटाना (जैसे “M”, “Male” को “Male” में मानकीकृत करना)।
- फ़िल्टरिंग: केवल आवश्यक डेटा का चयन करना।
- एग्रीगेशन: डेटा को सारांशित करना (जैसे, दैनिक बिक्री को मासिक बिक्री में बदलना)।
- जॉइनिंग: कई स्रोतों से संबंधित डेटा को मिलाना।
यह परिवर्तन एक अलग स्टेजिंग क्षेत्र में होता है ताकि स्रोत और गंतव्य सिस्टम पर बोझ न पड़े।
लोड (Load): रूपांतरित डेटा को अंतिम गंतव्य, यानी डेटा वेयरहाउस में लोड किया जाता है। लोडिंग दो तरीकों से की जा सकती है:
- फुल लोड: सभी डेटा को वेयरहाउस में लोड किया जाता है, आमतौर पर पहली बार।
- इन्क्रिमेंटल लोड: केवल नए या बदले हुए डेटा को लोड किया जाता है, जो प्रक्रिया को तेज और अधिक कुशल बनाता है।

ब्लॉक डायग्राम:

[स्रोत 1, स्रोत 2, …] –> [ एक्सट्रेक्शन इंजन ] –> [ स्टेजिंग एरिया ] –> [ ट्रांसफॉर्मेशन इंजन (क्लीनिंग, एग्रीगेशन, आदि)] –> [ लोडिंग इंजन ] –> [ डेटा वेयरहाउस ]

ETL बनाम ELT

ELT (एक्सट्रेक्ट, लोड, ट्रांसफॉर्म) एक वैकल्पिक दृष्टिकोण है जहां डेटा को पहले निकाला जाता है, फिर सीधे गंतव्य सिस्टम (जैसे क्लाउड डेटा वेयरहाउस) में लोड किया जाता है, और फिर उस गंतव्य सिस्टम की प्रोसेसिंग पावर का उपयोग करके रूपांतरित किया जाता है।

मुख्य अंतर:

परिवर्तन का स्थान: ETL में, परिवर्तन एक अलग स्टेजिंग सर्वर पर होता है। ELT में, परिवर्तन लक्ष्य डेटा वेयरहाउस के भीतर होता है।
डेटा लोडिंग: ETL केवल रूपांतरित और अक्सर संरचित डेटा लोड करता है। ELT कच्चे डेटा को लोड करने की अनुमति देता है।
उपयुक्तता: ETL पारंपरिक, ऑन-प्रिमाइसेस डेटा वेयरहाउस के लिए बेहतर है। ELT क्लाउड-आधारित, स्केलेबल डेटा वेयरहाउस और डेटा लेक के लिए आदर्श है जो विशाल कंप्यूटिंग शक्ति प्रदान करते हैं।
प्रदर्शन: ELT बड़े डेटासेट के लिए तेज हो सकता है क्योंकि यह लक्ष्य सिस्टम की समानांतर प्रसंस्करण क्षमताओं का लाभ उठाता है।

(b) नॉइजी डेटा के लिए बिनिंग विधि

नॉइजी डेटा में अर्थहीन डेटा, त्रुटियां या आउटलायर्स होते हैं। यह माप त्रुटियों, डेटा प्रविष्टि समस्याओं या अन्य मुद्दों के कारण हो सकता है। बिनिंग एक डेटा स्मूथिंग तकनीक है जिसका उपयोग नॉइजी डेटा के प्रभाव को कम करने के लिए किया जाता है। इसमें डेटा मानों को छोटी, असतत श्रेणियों या “बिन” में विभाजित करना शामिल है।

बिनिंग विधि के चरण:

सॉर्टिंग: पहले, डेटा मानों को आरोही क्रम में सॉर्ट करें।
विभाजन: सॉर्ट किए गए डेटा को लगभग समान आकार के कई “बिन” में विभाजित करें।
स्मूथिंग: प्रत्येक बिन में मानों को एक प्रतिनिधि मान से बदलें। सामान्य तकनीकें हैं:
- बिन मीन्स द्वारा स्मूथिंग: प्रत्येक बिन में सभी मानों को बिन के औसत (mean) से बदलें।
- बिन मीडियन द्वारा स्मूथिंग: प्रत्येक बिन में सभी मानों को बिन के माध्यिका (median) से बदलें।
- बिन बाउंड्री द्वारा स्मूथिंग: प्रत्येक बिन में मानों को निकटतम सीमा मान (न्यूनतम या अधिकतम) से बदलें।

उदाहरण:

मान लीजिए कि हमारे पास आयु का निम्नलिखित नॉइजी डेटा है: 4, 8, 15, 21, 21, 24, 25, 28, 34

चरण 1 और 2: डेटा पहले से ही सॉर्ट किया हुआ है। आइए इसे 3 बिनों में विभाजित करें (3 की गहराई):

बिन 1: 4, 8, 15
बिन 2: 21, 21, 24
बिन 3: 25, 28, 34

चरण 3: स्मूथिंग

बिन मीन्स द्वारा स्मूथिंग:
- बिन 1 का माध्य: (4+8+15)/3 = 9. तो बिन 1 बन जाता है: 9, 9, 9
- बिन 2 का माध्य: (21+21+24)/3 = 22. तो बिन 2 बन जाता है: 22, 22, 22
- बिन 3 का माध्य: (25+28+34)/3 = 29. तो बिन 3 बन जाता है: 29, 29, 29
स्मूथ किया हुआ डेटा: 9, 9, 9, 22, 22, 22, 29, 29, 29
बिन बाउंड्री द्वारा स्मूथिंग:
- बिन 1: 4, 8, 15 -> 4, 4, 15 (8, 4 के करीब है; 8, 15 से दूर है)
- बिन 2: 21, 21, 24 -> 21, 21, 24 (कोई बदलाव नहीं क्योंकि मान पहले से ही सीमा के करीब हैं)
- बिन 3: 25, 28, 34 -> 25, 25, 34 (28, 25 के करीब है; 28, 34 से दूर है)
स्मूथ किया हुआ डेटा: 4, 4, 15, 21, 21, 24, 25, 25, 34

बैग-ऑफ-वर्ड्स (BoW) प्राकृतिक भाषा प्रसंस्करण (NLP) में उपयोग की जाने वाली एक टेक्स्ट प्रतिनिधित्व तकनीक है। यह टेक्स्ट (जैसे एक वाक्य या एक दस्तावेज़) को एक संख्यात्मक वेक्टर में परिवर्तित करती है। यह मॉडल टेक्स्ट में शब्दों के क्रम या व्याकरण की परवाह नहीं करता है, लेकिन यह शब्दों की बहुलता (frequency) को बनाए रखता है। यह इसे शब्दों का एक “बैग” बनाता है, जहां शब्दों के बीच संबंध खो जाता है।

प्रक्रिया:

शब्दावली निर्माण (Vocabulary Creation): पूरे कॉर्पस (सभी दस्तावेज़ों का संग्रह) से सभी अद्वितीय शब्दों की एक सूची बनाएं।
वेक्टर निर्माण (Vector Creation): प्रत्येक दस्तावेज़ के लिए, एक वेक्टर बनाएं जिसका आकार शब्दावली के आकार के बराबर हो। प्रत्येक वेक्टर तत्व उस दस्तावेज़ में संबंधित शब्द की आवृत्ति (या उपस्थिति) का प्रतिनिधित्व करता है।

उदाहरण:

निम्नलिखित दो वाक्यों पर विचार करें:

S1: “The cat sat on the mat.”

S2: “The dog ate the cat.”

चरण 1: शब्दावली निर्माण

(सामान्य शब्दों जैसे ‘the’ को हटाने और केस को सामान्य करने के बाद)

शब्दावली: {cat, sat, on, mat, dog, ate}

चरण 2: वेक्टर निर्माण

प्रत्येक वाक्य को अब इस शब्दावली के आधार पर एक वेक्टर द्वारा दर्शाया जाएगा:

S1 वेक्टर: [1, 1, 1, 1, 0, 0] (cat:1, sat:1, on:1, mat:1, dog:0, ate:0)
S2 वेक्टर: [1, 0, 0, 0, 1, 1] (cat:1, sat:0, on:0, mat:0, dog:1, ate:1)

ये वैक्टर अब मशीन लर्निंग मॉडल के लिए इनपुट के रूप में उपयोग किए जा सकते हैं।

BoW के नुकसान (Drawbacks):

अर्थ और संदर्भ का अभाव (Loss of Semantics and Context): BoW शब्दों के बीच के संबंध और वाक्य के अर्थ को खो देता है। उदाहरण के लिए, “man bites dog” और “dog bites man” के BoW प्रतिनिधित्व समान होंगे, जबकि उनके अर्थ बहुत अलग हैं।
शब्द क्रम की उपेक्षा (Ignores Word Order): चूँकि यह शब्दों को एक “बैग” के रूप में मानता है, यह व्याकरण और वाक्य-विन्यास को पूरी तरह से नजरअंदाज कर देता है।
उच्च आयामीता (High Dimensionality): एक बड़ी शब्दावली के साथ, परिणामी वैक्टर बहुत बड़े और विरल (sparse) हो जाते हैं (अधिकांश तत्व शून्य होते हैं), जिससे कम्प्यूटेशनल समस्याएं हो सकती हैं।
आउट-ऑफ-वोकैबुलरी (OOV) शब्द: मॉडल उन शब्दों को संभाल नहीं सकता जो प्रशिक्षण के दौरान शब्दावली में नहीं थे।

(d) K-नियरेस्ट नेबर्स (KNN) एल्गोरिथम

K-नियरेस्ट नेबर्स (KNN) एक पर्यवेक्षित मशीन लर्निंग एल्गोरिथम है जिसका उपयोग वर्गीकरण (classification) और प्रतिगमन (regression) दोनों कार्यों के लिए किया जा सकता है। यह एक गैर-पैरामीट्रिक और आलसी (lazy) एल्गोरिथम है।

गैर-पैरामीट्रिक: इसका मतलब है कि यह अंतर्निहित डेटा वितरण के बारे में कोई धारणा नहीं बनाता है।
आलसी: इसका मतलब है कि यह प्रशिक्षण चरण के दौरान कोई मॉडल नहीं बनाता है। यह केवल प्रशिक्षण डेटासेट को संग्रहीत करता है और सभी गणनाएं भविष्यवाणी के समय करता है।

एल्गोरिथम:

एक नए, अनदेखे डेटा बिंदु का वर्गीकरण करने के लिए, KNN निम्नलिखित कदम उठाता है:

K का मान चुनें: एक पूर्णांक K चुनें, जो पड़ोसियों की संख्या है जिन पर विचार किया जाएगा। यह एक हाइपरपैरामीटर है।
दूरी की गणना करें: नए डेटा बिंदु और प्रशिक्षण डेटासेट में प्रत्येक डेटा बिंदु के बीच की दूरी की गणना करें। सामान्य दूरी मेट्रिक्स में यूक्लिडियन दूरी, मैनहट्टन दूरी या मिन्कोव्स्की दूरी शामिल हैं।
K निकटतम पड़ोसियों को पहचानें: गणना की गई दूरियों के आधार पर K सबसे छोटे दूरी वाले डेटा बिंदुओं (पड़ोसियों) का चयन करें।
भविष्यवाणी करें:
- वर्गीकरण (Classification) के लिए: नए डेटा बिंदु को K पड़ोसियों में सबसे आम वर्ग (majority vote) में असाइन करें।
- प्रतिगमन (Regression) के लिए: नए डेटा बिंदु का मान K पड़ोसियों के मानों का औसत (average) या माध्यिका (median) होता है।

KNN के लाभ (Advantages):

सरलता: एल्गोरिथम को समझना और लागू करना बहुत सरल है।
कोई प्रशिक्षण चरण नहीं: चूंकि यह एक आलसी एल्गोरिथम है, इसलिए इसे मॉडल बनाने के लिए प्रशिक्षण समय की आवश्यकता नहीं होती है।
लचीलापन: यह आसानी से बहु-वर्गीय वर्गीकरण और प्रतिगमन के लिए अनुकूल हो सकता है।
गैर-रैखिक डेटा के लिए अच्छा: यह जटिल निर्णय सीमाएं बना सकता है और गैर-रैखिक डेटा के साथ अच्छी तरह से काम कर सकता है।

KNN के नुकसान (Disadvantages):

उच्च कम्प्यूटेशनल लागत: भविष्यवाणी का समय लंबा होता है क्योंकि इसे प्रत्येक भविष्यवाणी के लिए पूरे प्रशिक्षण सेट से दूरी की गणना करनी पड़ती है।
मेमोरी गहन: इसे पूरे प्रशिक्षण डेटासेट को मेमोरी में संग्रहीत करने की आवश्यकता होती है।
K के मान के प्रति संवेदनशीलता: प्रदर्शन K के मान पर बहुत अधिक निर्भर करता है। एक बहुत छोटा K शोर के प्रति संवेदनशील होता है, और एक बहुत बड़ा K वर्गों के बीच की सीमा को धुंधला कर सकता है।
आयामों का अभिशाप (Curse of Dimensionality): यह उच्च-आयामी डेटा के साथ अच्छा प्रदर्शन नहीं करता है क्योंकि दूरियां कम सार्थक हो जाती हैं।
सुविधा स्केलिंग के प्रति संवेदनशील: विभिन्न पैमानों वाली विशेषताओं को सामान्यीकृत करने की आवश्यकता होती है अन्यथा बड़ी रेंज वाली विशेषताएं दूरी की गणना पर हावी हो जाएंगी।

Q2. (a) डेटा माइनिंग में क्लस्टरिंग तकनीक क्या है? एक उदाहरण दें। निम्नलिखित क्लस्टरिंग विधियों को संक्षेप में समझाएं: (i) घनत्व-आधारित विधि (ii) बाधा-आधारित विधि (b) एक विशिष्ट NLP प्रणाली में उपयोग किए जाने वाले निम्नलिखित टेक्स्ट प्रीप्रोसेसिंग चरणों की संक्षेप में व्याख्या करें: (i) सेगमेंटेशन (ii) टोकनाइजेशन (iii) स्टॉप वर्ड्स को हटाना (iv) स्टेमिंग

Ans.

(a) क्लस्टरिंग तकनीक

डेटा माइनिंग में क्लस्टरिंग एक अनसुपरवाइज्ड लर्निंग तकनीक है। इसका उद्देश्य डेटा पॉइंट्स के एक सेट को कई समूहों या “क्लस्टर्स” में विभाजित करना है, ताकि एक ही क्लस्टर के डेटा पॉइंट्स दूसरे क्लस्टर्स के डेटा पॉइंट्स की तुलना में एक-दूसरे से अधिक समान हों। समानता को आमतौर पर दूरी के मीट्रिक के आधार पर मापा जाता है, जैसे यूक्लिडियन दूरी। क्लस्टरिंग में, डेटा के लिए कोई पूर्वनिर्धारित लेबल नहीं होते हैं; एल्गोरिथ्म डेटा में प्राकृतिक समूहों को स्वयं खोजता है।

उदाहरण: एक मार्केटिंग टीम ग्राहकों के व्यवहार के आधार पर उन्हें अलग-अलग खंडों में बांटने के लिए क्लस्टरिंग का उपयोग कर सकती है। उदाहरण के लिए, एक क्लस्टर “उच्च-खर्च करने वाले, बार-बार आने वाले ग्राहक” का हो सकता है, जबकि दूसरा “बजट-सचेत, कभी-कभार खरीदारी करने वाले” का हो सकता है। यह कंपनियों को प्रत्येक समूह के लिए लक्षित मार्केटिंग अभियान बनाने में मदद करता है।

क्लस्टरिंग विधियाँ:

(i) घनत्व-आधारित विधि (Density-based method):

घनत्व-आधारित क्लस्टरिंग विधियाँ डेटा पॉइंट्स के घनत्व के आधार पर क्लस्टर बनाती हैं। ये विधियाँ मानती हैं कि क्लस्टर उच्च घनत्व वाले क्षेत्र होते हैं जो कम घनत्व वाले क्षेत्रों द्वारा अलग किए जाते हैं। ये विधियाँ मनमाने आकार के क्लस्टर खोज सकती हैं और नॉइज़ (आउटलायर्स) को संभालने में बहुत प्रभावी होती हैं।

मुख्य विचार: एक बिंदु एक क्लस्टर का हिस्सा होता है यदि उसके पड़ोस में एक निश्चित संख्या में अन्य बिंदु (एक न्यूनतम घनत्व) होते हैं।
उदाहरण एल्गोरिथ्म: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) । DBSCAN को दो मापदंडों की आवश्यकता होती है: `epsilon` (एक बिंदु के पड़ोस की त्रिज्या) और `min_points` (एक बिंदु को कोर बिंदु मानने के लिए `epsilon` त्रिज्या के भीतर आवश्यक बिंदुओं की न्यूनतम संख्या)। यह कोर पॉइंट्स, बॉर्डर पॉइंट्स और नॉइज़ पॉइंट्स की पहचान करता है, जिससे गैर-गोलाकार आकार के क्लस्टर बन सकते हैं।

(ii) बाधा-आधारित विधि (Constraint-based method):

बाधा-आधारित क्लस्टरिंग पारंपरिक क्लस्टरिंग का एक विस्तार है जो क्लस्टरिंग प्रक्रिया में डोमेन ज्ञान या उपयोगकर्ता की प्राथमिकताओं को शामिल करता है। ये बाधाएं मार्गदर्शन करती हैं कि एल्गोरिथ्म डेटा को कैसे समूहित करता है, जिससे परिणाम अधिक प्रासंगिक और उपयोगी बनते हैं।

बाधाओं के प्रकार:
- मस्ट-लिंक (Must-link): यह निर्दिष्ट करता है कि दो डेटा पॉइंट्स को एक ही क्लस्टर में होना चाहिए। उदाहरण: एक ही व्यक्ति के दो अलग-अलग ग्राहक रिकॉर्ड।
- कैनॉट-लिंक (Cannot-link): यह निर्दिष्ट करता है कि दो डेटा पॉइंट्स को अलग-अलग क्लस्टर में होना चाहिए। उदाहरण: प्रतिस्पर्धी कंपनियों से संबंधित डेटा।
- आकार या संख्या संबंधी बाधाएं: क्लस्टरों के आकार, संख्या या बिंदुओं की कुल संख्या पर सीमाएं।
उपयोग: यह तब उपयोगी होता है जब विशुद्ध रूप से डेटा-संचालित क्लस्टरिंग व्यावसायिक तर्क या ज्ञात तथ्यों के विपरीत परिणाम देती है। बाधाओं को शामिल करके, अंतिम क्लस्टरिंग उपयोगकर्ता की अपेक्षाओं के साथ बेहतर ढंग से संरेखित होती है।

(b) टेक्स्ट प्रीप्रोसेसिंग चरण

टेक्स्ट प्रीप्रोसेसिंग किसी भी प्राकृतिक भाषा प्रसंस्करण (NLP) प्रणाली में एक महत्वपूर्ण पहला कदम है। इसका उद्देश्य कच्चे टेक्स्ट डेटा को साफ करना और इसे एक ऐसे प्रारूप में बदलना है जो मशीन लर्निंग मॉडल के लिए उपयुक्त हो। यहाँ कुछ सामान्य चरण दिए गए हैं:

(i) सेगमेंटेशन (Segmentation):

यह टेक्स्ट के एक बड़े हिस्से को छोटे, सार्थक इकाइयों में तोड़ने की प्रक्रिया है। आमतौर पर, इसका मतलब है एक दस्तावेज़ को अलग-अलग वाक्यों में विभाजित करना। वाक्य सीमा का पता लगाना महत्वपूर्ण है क्योंकि वाक्य अक्सर विचार की एक पूरी इकाई का प्रतिनिधित्व करते हैं। यह विराम चिह्नों जैसे कि अवधि (.), प्रश्न चिह्न (?), और विस्मयादिबोधक चिह्न (!) की पहचान करके किया जाता है।

उदाहरण: “Dr. Smith lives in the U.S. Isn’t that interesting?” को दो खंडों में विभाजित किया जाएगा: “Dr. Smith lives in the U.S.” और “Isn’t that interesting?”. (ii) टोकनाइजेशन (Tokenization):

सेगमेंटेशन के बाद, टोकनाइजेशन प्रत्येक वाक्य को अलग-अलग शब्दों या “टोकन” में तोड़ता है। ये टोकन विश्लेषण की मूल इकाइयाँ बन जाते हैं। टोकनाइजेशन आमतौर पर स्पेस और विराम चिह्नों के आधार पर किया जाता है।

उदाहरण: वाक्य “The cat sat on the mat” को टोकन में विभाजित किया जाएगा: [“The”, “cat”, “sat”, “on”, “the”, “mat”]। (iii) स्टॉप वर्ड्स को हटाना (Removal of stop words):

स्टॉप वर्ड्स बहुत आम शब्द होते हैं जो अक्सर किसी वाक्य के अर्थ में बहुत कम या कोई जानकारी नहीं जोड़ते हैं (जैसे “a”, “an”, “the”, “in”, “is”)। इन शब्दों को हटाने से डेटा का आकार कम हो जाता है और मॉडल को अधिक सार्थक शब्दों पर ध्यान केंद्रित करने में मदद मिलती है। प्रत्येक NLP लाइब्रेरी में स्टॉप वर्ड्स की एक पूर्वनिर्धारित सूची होती है, जिसे विशिष्ट एप्लिकेशन के लिए अनुकूलित किया जा सकता है।

उदाहरण: टोकन की सूची [“The”, “cat”, “sat”, “on”, “the”, “mat”] से स्टॉप वर्ड्स (“The”, “on”, “the”) को हटाने के बाद, यह बन जाएगी: [“cat”, “sat”, “mat”]। (iv) स्टेमिंग (Stemming):

स्टेमिंग एक शब्द को उसके मूल या “स्टेम” रूप में कम करने की प्रक्रिया है। इसका उद्देश्य संबंधित शब्दों (जैसे “running”, “ran”, “runs”) को एक ही टोकन (“run”) में मैप करना है। यह शब्दावली के आकार को कम करता है और यह सुनिश्चित करता है कि एक ही मूल अवधारणा वाले शब्दों को समान माना जाए। स्टेमिंग एक नियम-आधारित प्रक्रिया है जो अक्सर शब्द के अंत से प्रत्यय (suffixes) को काट देती है, जिससे कभी-कभी ऐसे स्टेम बन सकते हैं जो वास्तविक शब्द नहीं होते हैं (जैसे “studies” -> “studi”)। एक और परिष्कृत तकनीक लेमेटाइजेशन (Lemmatization) है, जो शब्द के अर्थ को ध्यान में रखकर उसे उसके शब्दकोश रूप (lemma) में बदलती है।

उदाहरण: स्टेमिंग का उपयोग करके, शब्द “computing”, “computer”, “computed” सभी “comput” में कम हो सकते हैं।

Q3. (a) एक डेटा वेयरहाउस में डायमेंशनल मॉडलिंग के स्नोफ्लेक और फैक्ट कंस्टेलेशन स्कीमा की व्याख्या प्रत्येक के लिए एक उदाहरण की मदद से करें। (b) प्रत्येक के लिए एक ब्लॉक डायग्राम के साथ निम्नलिखित प्रकार की OLAP आर्किटेक्चर का संक्षिप्त वर्णन करें: (i) ROLAP आर्किटेक्चर (ii) MOLAP आर्किटेक्चर

Ans.

(a) स्नोफ्लेक और फैक्ट कंस्टेलेशन स्कीमा

डायमेंशनल मॉडलिंग एक डेटाबेस डिजाइन तकनीक है जिसका उपयोग डेटा वेयरहाउस में किया जाता है। इसका लक्ष्य डेटा को इस तरह से संरचित करना है कि प्रश्नों को समझना और निष्पादित करना आसान हो।

1. स्नोफ्लेक स्कीमा (Snowflake Schema)

स्नोफ्लेक स्कीमा, स्टार स्कीमा का एक तार्किक विस्तार है। स्टार स्कीमा में, एक केंद्रीय फैक्ट टेबल सीधे कई डायमेंशन टेबल से जुड़ी होती है। स्नोफ्लेक स्कीमा में, कुछ डायमेंशन टेबल को नॉर्मलाइज किया जाता है, जिससे वे कई संबंधित टेबलों में विभाजित हो जाती हैं। यह रिडंडेंसी को कम करता है और डेटा इंटेग्रिटी में सुधार करता है, लेकिन इसके कारण प्रश्नों में अधिक जॉइन की आवश्यकता होती है, जो प्रदर्शन को प्रभावित कर सकता है। इसकी संरचना एक स्नोफ्लेक (बर्फ के टुकड़े) जैसी दिखती है, इसीलिए इसका यह नाम है।

उदाहरण:

एक सेल्स डेटा वेयरहाउस पर विचार करें।

फैक्ट टेबल: Sales (TransactionID, ProductKey, LocationKey, TimeKey, UnitsSold, Revenue)
डायमेंशन टेबल (स्टार स्कीमा में):
- DimProduct (ProductKey, ProductName, Category)
- DimLocation (LocationKey, Street, City, State, Country)
- DimTime (TimeKey, Day, Month, Year)

स्नोफ्लेक स्कीमा में,

DimLocation टेबल को नॉर्मलाइज किया जा सकता है:

DimLocation (LocationKey, Street, CityKey)
DimCity (CityKey, CityName, StateKey)
DimState (StateKey, StateName, CountryName)

इस प्रकार, Sales फैक्ट टेबल DimLocation से जुड़ती है, जो DimCity से जुड़ती है, और वह DimState से जुड़ती है, जिससे एक स्नोफ्लेक जैसी संरचना बनती है।

2. फैक्ट कंस्टेलेशन स्कीमा (Fact Constellation Schema)

फैक्ट कंस्टेलेशन स्कीमा, जिसे गैलेक्सी स्कीमा (Galaxy Schema) भी कहा जाता है, में कई फैक्ट टेबल होती हैं जो कुछ डायमेंशन टेबल को साझा करती हैं। यह एक जटिल संरचना है जो कई व्यावसायिक प्रक्रियाओं को एक ही डेटा वेयरहाउस मॉडल में एकीकृत करने की अनुमति देती है। अनिवार्य रूप से, यह कई स्टार स्कीमाओं का संग्रह है जो डायमेंशन टेबल साझा करते हैं।

उदाहरण:

एक रिटेल कंपनी के लिए एक मॉडल पर विचार करें जो बिक्री (sales) और शिपिंग (shipping) दोनों को ट्रैक करना चाहती है।

फैक्ट टेबल 1 (Sales): SalesFact (TimeKey, ProductKey, StoreKey, UnitsSold, Revenue)
फैक्ट टेबल 2 (Shipping): ShippingFact (TimeKey, ProductKey, ShipperKey, UnitsShipped, ShippingCost)

साझा डायमेंशन टेबल:

DimTime (TimeKey, Day, Month, Year)
DimProduct (ProductKey, ProductName, Brand)

अन्य डायमेंशन टेबल:

DimStore (StoreKey, StoreName, City) – केवल SalesFact से जुड़ा है।
DimShipper (ShipperKey, ShipperName) – केवल ShippingFact से जुड़ा है।

इस मॉडल में, SalesFact और ShippingFact दोनों DimTime और DimProduct डायमेंशन को साझा करते हैं। यह विश्लेषकों को उन प्रश्नों को करने की अनुमति देता है जो दोनों व्यावसायिक प्रक्रियाओं को पार करते हैं, जैसे “पिछले महीने एक विशिष्ट उत्पाद की कितनी इकाइयाँ बिकीं और कितनी भेजी गईं?”। (b) OLAP आर्किटेक्चर

ऑनलाइन एनालिटिकल प्रोसेसिंग (OLAP) एक तकनीक है जो उपयोगकर्ताओं को बहु-आयामी दृष्टिकोण से डेटा का विश्लेषण करने में सक्षम बनाती है। OLAP सिस्टम को उनकी अंतर्निहित डेटा स्टोरेज आर्किटेक्चर के आधार पर वर्गीकृत किया जाता है।

(i) ROLAP (Relational OLAP) आर्किटेक्चर

ROLAP आर्किटेक्चर में, डेटा को एक मानक रिलेशनल डेटाबेस मैनेजमेंट सिस्टम (RDBMS) में स्टोर किया जाता है। OLAP सर्वर रिलेशनल डेटाबेस के ऊपर एक मध्यस्थ के रूप में कार्य करता है, जो उपयोगकर्ता के बहु-आयामी प्रश्नों को मानक SQL प्रश्नों में अनुवादित करता है। समरी और एग्रीगेट टेबल को अक्सर प्रदर्शन में सुधार के लिए प्री-कैलकुलेट और स्टोर किया जाता है, लेकिन विस्तृत डेटा रिलेशनल प्रारूप में ही रहता है।

ब्लॉक डायग्राम:

[ यूजर इंटरफेस ] <–> [ ROLAP सर्वर (मेटाडेटा मैनेजर, क्वेरी ट्रांसलेटर)] <–> [ रिलेशनल डेटाबेस (RDBMS) ]

|— [ डेटा वेयरहाउस (फैक्ट और डायमेंशन टेबल)]

विशेषताएँ:

स्केलेबिलिटी: RDBMS की क्षमताओं के कारण बड़ी मात्रा में डेटा को संभाल सकता है।
डेटा स्टोरेज: डेटा को अलग से कॉपी करने की आवश्यकता नहीं है, जिससे डेटा रिडंडेंसी कम होती है।
प्रदर्शन: जटिल प्रश्नों के लिए धीमा हो सकता है क्योंकि एग्रीगेशन की गणना रन-टाइम पर की जाती है।
लचीलापन: SQL का उपयोग करने के कारण अधिक लचीला है।

(ii) MOLAP (Multidimensional OLAP) आर्किटेक्चर

MOLAP आर्किटेक्चर में, डेटा को एक विशेष बहु-आयामी डेटाबेस (MDDB) या “क्यूब” में संग्रहीत किया जाता है। यह क्यूब डेटा का एक प्री-एग्रीगेटेड, अनुकूलित दृश्य है जिसे रिलेशनल डेटा स्रोत से निकाला जाता है। चूँकि सभी संभावित एग्रीगेशन और पदानुक्रम पहले से ही गणना और संग्रहीत कर लिए जाते हैं, MOLAP बहुत तेज़ क्वेरी प्रदर्शन प्रदान करता है। स्लाइसिंग, डाइसिंग और ड्रिल-डाउन जैसे ऑपरेशन अत्यंत तीव्र होते हैं।

ब्लॉक डायग्राम:

[ यूजर इंटरफेस ] <–> [ MOLAP सर्वर (क्यूब इंजन)] <–> [ बहु-आयामी क्यूब (MDDB) ]

|— (डेटा स्रोत से लोड) –> [ डेटा वेयरहाउस (RDBMS) ]

विशेषताएँ:

प्रदर्शन: विश्लेषण और क्वेरी के लिए बहुत तेज़ है क्योंकि डेटा प्री-एग्रीगेटेड होता है।
डेटा स्टोरेज: रिडंडेंट स्टोरेज की आवश्यकता होती है क्योंकि डेटा को क्यूब में कॉपी और संग्रहीत किया जाता है, जिससे स्टोरेज की आवश्यकता बढ़ जाती है।
स्केलेबिलिटी: ROLAP की तुलना में कम स्केलेबल है; बड़े क्यूब्स को प्रबंधित करना मुश्किल हो सकता है (“क्यूब एक्सप्लोजन” समस्या)।
विश्लेषण क्षमता: जटिल विश्लेषणात्मक गणनाओं के लिए बेहतर अनुकूल है।

Q4. (a) एक ब्लॉक डायग्राम के साथ रियल-टाइम डेटा वेयरहाउस (RTDW) आर्किटेक्चर की व्याख्या करें। उदाहरण दें जहां इस प्रकार के डेटा वेयरहाउस उपयोगी हैं। (b) डेटा मार्ट्स क्या हैं? वे डेटा वेयरहाउस से कैसे भिन्न हैं? डेटा मार्ट्स के डिजाइन पर चर्चा करें।

Ans.

(a) रियल-टाइम डेटा वेयरहाउस (RTDW) आर्किटेक्चर

एक रियल-टाइम डेटा वेयरहाउस (RTDW) , जिसे एक्टिव डेटा वेयरहाउस भी कहा जाता है, एक ऐसा डेटा वेयरहाउस है जो ऑपरेशनल सिस्टम से डेटा को लगभग तुरंत (सेकंडों या मिनटों में) प्राप्त और एकीकृत करता है। यह पारंपरिक डेटा वेयरहाउस के विपरीत है, जो आमतौर पर बैचों में (अक्सर रात में) अपडेट होते हैं। RTDW का लक्ष्य निर्णय लेने वालों को अप-टू-द-मिनट डेटा प्रदान करना है, जिससे वे तेजी से बदलती व्यावसायिक स्थितियों पर तुरंत प्रतिक्रिया दे सकें।

RTDW आर्किटेक्चर:

एक RTDW आर्किटेक्चर को ऑपरेशनल स्रोतों से डेटा को न्यूनतम विलंबता के साथ पकड़ने, रूपांतरित करने और लोड करने के लिए डिज़ाइन किया गया है।

ब्लॉक डायग्राम:

[ ऑपरेशनल सिस्टम (OLTP) ] –> [ डेटा कैप्चर मैकेनिज्म (जैसे CDC, मैसेज क्यू)] –> [ रियल-टाइम ETL/ELT पाइपलाइन (स्ट्रीम प्रोसेसिंग इंजन जैसे Kafka, Spark Streaming)] –> [ रियल-टाइम डेटा स्टोर ] –> [ एनालिटिक्स/BI टूल्स और डैशबोर्ड ]

घटक:

डेटा कैप्चर: डेटा को स्रोतों से लगभग तुरंत पकड़ा जाता है। इसके लिए चेंज डेटा कैप्चर (CDC) तकनीकों का उपयोग किया जाता है जो डेटाबेस लॉग को पढ़ते हैं, या मैसेजिंग क्यू (जैसे Kafka) का उपयोग किया जाता है जहां स्रोत एप्लिकेशन इवेंट्स प्रकाशित करते हैं।
रियल-टाइम डेटा इंटीग्रेशन: डेटा को एक स्ट्रीम प्रोसेसिंग इंजन में भेजा जाता है जो इसे फ्लाई पर रूपांतरित और समृद्ध करता है। यह पारंपरिक बैच ETL के विपरीत है।
डेटा स्टोरेज और सर्विंग: रूपांतरित डेटा को एक ऐसे वेयरहाउस में लोड किया जाता है जो तेजी से लिखने और पढ़ने दोनों का समर्थन करता है। यह अक्सर पारंपरिक वेयरहाउस और एक तेज, रियल-टाइम लेयर का एक हाइब्रिड होता है।
एनालिटिक्स और विज़ुअलाइज़ेशन: उपयोगकर्ता रियल-टाइम डैशबोर्ड और BI टूल्स के माध्यम से नवीनतम डेटा तक पहुँच प्राप्त करते हैं, जो वेयरहाउस से लगातार अपडेट होते हैं।

RTDW के उपयोग के उदाहरण:

धोखाधड़ी का पता लगाना (Fraud Detection): वित्तीय संस्थान क्रेडिट कार्ड लेनदेन को वास्तविक समय में विश्लेषण करने के लिए RTDW का उपयोग करते हैं ताकि धोखाधड़ी वाले पैटर्न का तुरंत पता लगाया जा सके और उन्हें ब्लॉक किया जा सके।
डायनामिक प्राइसिंग (Dynamic Pricing): ई-कॉमर्स और एयरलाइन कंपनियाँ बाजार की मांग, प्रतियोगी मूल्य निर्धारण और ग्राहक व्यवहार के आधार पर वास्तविक समय में कीमतों को समायोजित करने के लिए RTDW का उपयोग करती हैं।
इन्वेंटरी प्रबंधन: खुदरा विक्रेता अपनी आपूर्ति श्रृंखला को अनुकूलित करने, स्टॉक-आउट से बचने और ओवरस्टॉकिंग को कम करने के लिए वास्तविक समय में इन्वेंट्री स्तरों की निगरानी करते हैं।
ग्राहक संबंध प्रबंधन (CRM): कॉल सेंटर के एजेंटों के पास ग्राहक के साथ बातचीत करते समय उनकी सबसे हाल की गतिविधियों और खरीद के बारे में नवीनतम जानकारी होती है, जिससे वे बेहतर सेवा प्रदान कर सकते हैं।

(b) डेटा मार्ट्स

डेटा मार्ट एक डेटा वेयरहाउस का एक सबसेट है जो एक विशिष्ट व्यावसायिक लाइन, विभाग या विषय क्षेत्र पर केंद्रित होता है। जबकि एक एंटरप्राइज डेटा वेयरहाउस पूरे संगठन के लिए डेटा संग्रहीत करता है, एक डेटा मार्ट केवल एक विशिष्ट समूह के उपयोगकर्ताओं की आवश्यकताओं को पूरा करता है। उदाहरण के लिए, एक कंपनी के पास बिक्री, विपणन और वित्त के लिए अलग-अलग डेटा मार्ट हो सकते हैं।

डेटा मार्ट बनाम डेटा वेयरहाउस:

विशेषता

डेटा मार्ट

डेटा वेयरहाउस

स्कोप

विभाग-विशिष्ट (जैसे बिक्री, विपणन)

एंटरप्राइज-वाइड (पूरा संगठन)

विषय

एकल विषय या व्यावसायिक प्रक्रिया

अनेक विषय

डेटा स्रोत

कुछ स्रोत, या केंद्रीय डेटा वेयरहाउस से

अनेक, विविध स्रोत

आकार

छोटा (<100 GB)

बड़ा (100GB से कई टेराबाइट्स या पेटाबाइट्स)

उपयोगकर्ता

एक विभाग के भीतर सीमित उपयोगकर्ता

पूरे संगठन में अनेक उपयोगकर्ता

विकास समय

कम (कुछ महीने)

लंबा (कई महीने से लेकर वर्षों तक)

डेटा मार्ट्स का डिज़ाइन:

डेटा मार्ट्स को डिजाइन करने के दो मुख्य दृष्टिकोण हैं:

टॉप-डाउन दृष्टिकोण (Dependent Data Marts):
- इस दृष्टिकोण में, पहले एक एंटरप्राइज-वाइड डेटा वेयरहाउस बनाया जाता है।
- फिर, इस केंद्रीय वेयरहाउस से डेटा निकालकर विशिष्ट डेटा मार्ट बनाए जाते हैं।
- इन डेटा मार्ट्स को आश्रित (dependent) कहा जाता है क्योंकि वे केंद्रीय डेटा वेयरहाउस पर निर्भर करते हैं।
- लाभ: यह डेटा की संगति और एकीकरण सुनिश्चित करता है। सभी विभाग “सत्य के एक ही संस्करण” से काम कर रहे होते हैं।
बॉटम-अप दृष्टिकोण (Independent Data Marts):
- इस दृष्टिकोण में, डेटा मार्ट्स को सीधे ऑपरेशनल स्रोतों से डेटा लेकर स्वतंत्र रूप से बनाया जाता है, बिना किसी केंद्रीय डेटा वेयरहाउस के।
- प्रत्येक विभाग अपनी जरूरतों के अनुसार अपना डेटा मार्ट बनाता है।
- लाभ: तेजी से कार्यान्वयन और प्रारंभिक लागत कम होती है।
- नुकसान: इससे “डेटा साइलो” बन सकते हैं, जहां विभिन्न डेटा मार्ट्स में असंगत डेटा और परिभाषाएं होती हैं, जिससे एंटरप्राइज-वाइड विश्लेषण करना मुश्किल हो जाता है।

आमतौर पर, डेटा मार्ट के भीतर डेटा को स्टार स्कीमा का उपयोग करके संरचित किया जाता है, क्योंकि यह सरल, समझने में आसान होता है और विशिष्ट विभाग की विश्लेषणात्मक आवश्यकताओं के लिए कुशल क्वेरी प्रदर्शन प्रदान करता है।

Q5. निम्नलिखित पर संक्षिप्त नोट्स लिखें: 4×5=20 (a) डेटा लेक (b) स्टार स्कीमा (c) मार्केट बास्केट एनालिसिस (d) क्लाउड डेटा वेयरहाउस

Ans.

(a) डेटा लेक (Data Lake)

एक डेटा लेक एक केंद्रीकृत भंडार है जो आपको अपने सभी संरचित (structured), अर्ध-संरचित (semi-structured), और असंरचित (unstructured) डेटा को किसी भी पैमाने पर संग्रहीत करने की अनुमति देता है। डेटा वेयरहाउस के विपरीत, जहां डेटा को संग्रहीत करने से पहले एक पूर्वनिर्धारित स्कीमा में रूपांतरित किया जाता है, डेटा लेक डेटा को उसके मूल, कच्चे प्रारूप में संग्रहीत करता है। इस दृष्टिकोण को “स्कीमा-ऑन-रीड” कहा जाता है, जिसका अर्थ है कि डेटा को केवल तभी संरचित और रूपांतरित किया जाता है जब विश्लेषण के लिए उसकी आवश्यकता होती है। यह अधिक लचीलापन प्रदान करता है और डेटा वैज्ञानिकों को मशीन लर्निंग और उन्नत एनालिटिक्स जैसे कार्यों के लिए विविध डेटासेट का उपयोग करने की अनुमति देता है। डेटा लेक अक्सर Hadoop (HDFS) या क्लाउड स्टोरेज (जैसे Amazon S3, Azure Blob Storage) पर बनाए जाते हैं। (b) स्टार स्कीमा (Star Schema)

स्टार स्कीमा डेटा वेयरहाउस में डायमेंशनल मॉडलिंग का सबसे सरल और सबसे आम रूप है। इसकी संरचना एक तारे जैसी दिखती है, जिसमें केंद्र में एक बड़ी फैक्ट टेबल होती है जो कई छोटी डायमेंशन टेबलों से घिरी होती है।

फैक्ट टेबल: इसमें संख्यात्मक व्यावसायिक माप (measures) होते हैं, जैसे ‘बिक्री राशि’, ‘बेची गई इकाइयाँ’, और डायमेंशन टेबलों के लिए फॉरेन की (foreign keys)।
डायमेंशन टेबल: इसमें वर्णनात्मक, विशेषता-आधारित जानकारी (attributes) होती है, जैसे ‘उत्पाद का नाम’, ‘शहर’, ‘तारीख’। ये टेबल डी-नॉर्मलाइज्ड होती हैं, जिसका अर्थ है कि वे अनावश्यकता को स्वीकार करती हैं ताकि क्वेरी प्रदर्शन में सुधार हो सके।

स्टार स्कीमा को समझना आसान होता है और यह BI टूल्स द्वारा उत्पन्न प्रश्नों के लिए तेज़ प्रदर्शन प्रदान करता है क्योंकि इसमें कम जॉइन की आवश्यकता होती है। (c) मार्केट बास्केट एनालिसिस (Market Basket Analysis)

मार्केट बास्केट एनालिसिस एक डेटा माइनिंग तकनीक है जिसका उपयोग यह पता लगाने के लिए किया जाता है कि ग्राहक की खरीदारी में कौन सी वस्तुएँ अक्सर एक साथ खरीदी जाती हैं। यह रिटेल उद्योग में बहुत लोकप्रिय है और इसका लक्ष्य वस्तुओं के बीच एसोसिएशन रूल्स (association rules) को खोजना है। एक क्लासिक उदाहरण है “जो ग्राहक ब्रेड खरीदते हैं, वे मक्खन भी खरीदते हैं”। इस विश्लेषण के परिणाम का उपयोग स्टोर लेआउट को अनुकूलित करने, क्रॉस-सेलिंग और अप-सेलिंग अवसरों की पहचान करने और लक्षित विपणन अभियान बनाने के लिए किया जा सकता है। इस विश्लेषण को करने के लिए सबसे प्रसिद्ध एल्गोरिदम में से एक Apriori एल्गोरिदम है, जो नियमों को खोजने के लिए सपोर्ट, कॉन्फिडेंस और लिफ्ट जैसे मापदंडों का उपयोग करता है। (d) क्लाउड डेटा वेयरहाउस (Cloud Data Warehouse)

एक क्लाउड डेटा वेयरहाउस एक डेटाबेस है जिसे क्लाउड कंप्यूटिंग प्लेटफॉर्म पर होस्ट किया जाता है और “सेवा के रूप में” (as-a-service) प्रदान किया जाता है। पारंपरिक ऑन-प्रिमाइसेस डेटा वेयरहाउस के विपरीत, जिन्हें महंगे हार्डवेयर और रखरखाव की आवश्यकता होती है, क्लाउड डेटा वेयरहाउस प्रदाता द्वारा पूरी तरह से प्रबंधित किए जाते हैं। प्रमुख उदाहरणों में Amazon Redshift , Google BigQuery , और Snowflake शामिल हैं।

मुख्य लाभ:

स्केलेबिलिटी: आवश्यकतानुसार कंप्यूट और स्टोरेज संसाधनों को आसानी से बढ़ा या घटा सकते हैं।
लागत-प्रभावशीलता: आप केवल उपयोग किए गए संसाधनों के लिए भुगतान करते हैं (pay-as-you-go), जिससे अग्रिम पूंजी निवेश कम हो जाता है।
प्रदर्शन: बड़े पैमाने पर समानांतर प्रसंस्करण (MPP) आर्किटेक्चर का उपयोग करके जटिल प्रश्नों को तेजी से निष्पादित करने के लिए डिज़ाइन किया गया है।
प्रबंधित सेवा: प्रदाता हार्डवेयर प्रबंधन, सॉफ्टवेयर अपडेट और सुरक्षा का ध्यान रखता है, जिससे आईटी टीमों पर बोझ कम होता है।

IGNOU MCS-221 Previous Year Solved Question Paper in English

Q1. (a) With the help of a block diagram, explain the Extract, Transform and Loading (ETL) process of a Data Warehouse. Also, discuss how it is different from the ELT process. (b) With the help of an example, explain the Binning method for solving the problem of Noisy data. (c) Discuss the following text transformation technique that helps text sentences into numeric vectors: “Bag-of-Words (BoW)”. Also give an example. Mention the drawbacks of using BoW. (d) Write and explain the K-Nearest Neighbour’s Algorithm. Also mention its advantages and disadvantages.

Ans.

(a) Extract, Transform, and Load (ETL) Process

ETL is a data integration process that collects data from multiple sources, converts it into a consistent and usable format, and then loads it into a destination system, typically a Data Warehouse. It is a fundamental component of data warehousing.

Stages of the ETL Process:

Extract: In this phase, data is extracted from various source systems. These sources can be relational databases (e.g., Oracle, SQL Server), flat files (e.g., CSV, XML), NoSQL databases, or APIs. The data is extracted efficiently with minimal impact on the source systems.
Transform: This is the most complex stage. The extracted raw data is cleaned, validated, and transformed. Common transformations include:
- Cleaning: Fixing or removing inconsistent data (e.g., standardizing “M”, “Male” to “Male”).
- Filtering: Selecting only the required data.
- Aggregation: Summarizing data (e.g., aggregating daily sales into monthly sales).
- Joining: Combining related data from multiple sources.
This transformation takes place in a separate staging area to avoid burdening the source and destination systems.
Load: The transformed data is loaded into the final destination, the Data Warehouse. Loading can be done in two ways:
- Full Load: All data is loaded into the warehouse, typically for the first time.
- Incremental Load: Only new or changed data is loaded, which makes the process faster and more efficient.

Block Diagram: [Source 1, Source 2, …] –> [ Extraction Engine ] –> [ Staging Area ] –> [ Transformation Engine (Cleaning, Aggregation, etc.)] –> [ Loading Engine ] –> [ Data Warehouse ]

ETL vs. ELT

ELT (Extract, Load, Transform) is an alternative approach where data is first extracted, then loaded directly into the destination system (like a cloud data warehouse), and then transformed using the processing power of that destination system.

Key Differences:

Location of Transformation: In ETL, transformation happens in a separate staging server. In ELT, transformation happens within the target data warehouse.
Data Loading: ETL loads only transformed and often structured data. ELT allows for the loading of raw data.
Suitability: ETL is better for traditional, on-premises data warehouses. ELT is ideal for cloud-based, scalable data warehouses and data lakes that offer vast computing power.
Performance: ELT can be faster for large datasets as it leverages the parallel processing capabilities of the target system.

(b) Binning Method for Noisy Data

Noisy data contains meaningless data, errors, or outliers. It can be caused by measurement errors, data entry problems, or other issues. Binning is a data smoothing technique used to minimize the effect of noisy data. It involves partitioning data values into a small number of discrete ranges or “bins.”

Steps of the Binning Method:

Sorting: First, sort the data values in ascending order.
Partitioning: Partition the sorted data into a number of “bins” of approximately equal size.
Smoothing: Replace the values in each bin with a representative value. Common techniques are:
- Smoothing by bin means: Replace all values in each bin with the bin’s average (mean).
- Smoothing by bin medians: Replace all values in each bin with the bin’s median.
- Smoothing by bin boundaries: Replace the values in each bin with the nearest boundary value (minimum or maximum).

Example:

Suppose we have the following noisy data for age: 4, 8, 15, 21, 21, 24, 25, 28, 34

Steps 1 & 2: The data is already sorted. Let’s partition it into 3 bins (depth of 3):

Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Step 3: Smoothing

Smoothing by bin means:
- Mean of Bin 1: (4+8+15)/3 = 9. So Bin 1 becomes: 9, 9, 9
- Mean of Bin 2: (21+21+24)/3 = 22. So Bin 2 becomes: 22, 22, 22
- Mean of Bin 3: (25+28+34)/3 = 29. So Bin 3 becomes: 29, 29, 29
Smoothed data: 9, 9, 9, 22, 22, 22, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 8, 15 -> 4, 4, 15 (8 is closer to 4 than 15)
- Bin 2: 21, 21, 24 -> 21, 21, 24 (no change as values are close to boundaries)
- Bin 3: 25, 28, 34 -> 25, 25, 34 (28 is closer to 25 than 34)
Smoothed data: 4, 4, 15, 21, 21, 24, 25, 25, 34

(c) Bag-of-Words (BoW)

Bag-of-Words (BoW) is a text representation technique used in Natural Language Processing (NLP). It converts a piece of text (like a sentence or a document) into a numerical vector. This model disregards the order or grammar of words in the text but maintains their multiplicity (frequency). This makes it a “bag” of words, where the relationship between words is lost.

Process:

Vocabulary Creation: Create a list of all unique words from the entire corpus (collection of all documents).
Vector Creation: For each document, create a vector that is the size of the vocabulary. Each vector element represents the frequency (or presence) of the corresponding word in that document.

Example:

Consider the following two sentences: S1: “The cat sat on the mat.” S2: “The dog ate the cat.”

Step 1: Vocabulary Creation (After removing common words like ‘the’ and normalizing case) Vocabulary: {cat, sat, on, mat, dog, ate}

Step 2: Vector Creation Each sentence will now be represented by a vector based on this vocabulary:

S1 vector: [1, 1, 1, 1, 0, 0] (cat:1, sat:1, on:1, mat:1, dog:0, ate:0)
S2 vector: [1, 0, 0, 0, 1, 1] (cat:1, sat:0, on:0, mat:0, dog:1, ate:1)

These vectors can now be used as input for machine learning models.

Drawbacks of BoW:

Loss of Semantics and Context: BoW loses the relationship between words and the meaning of the sentence. For example, “man bites dog” and “dog bites man” would have similar BoW representations, while their meanings are very different.
Ignores Word Order: Since it treats words as a “bag,” it completely ignores grammar and syntax.
High Dimensionality: With a large vocabulary, the resulting vectors become very large and sparse (most elements are zero), which can lead to computational problems.
Out-of-Vocabulary (OOV) Words: The model cannot handle words that were not in the vocabulary during training.

(d) K-Nearest Neighbour’s (KNN) Algorithm

K-Nearest Neighbours (KNN) is a supervised machine learning algorithm that can be used for both classification and regression tasks. It is a non-parametric and lazy algorithm.

Non-parametric: This means it makes no assumptions about the underlying data distribution.
Lazy: This means it does not build a model during the training phase. It simply stores the training dataset and performs all calculations at the time of prediction.

The Algorithm:

To classify a new, unseen data point, KNN takes the following steps:

Choose a value for K: Select an integer K, which is the number of neighbours to be considered. This is a hyperparameter.
Calculate Distance: Calculate the distance between the new data point and every data point in the training dataset. Common distance metrics include Euclidean distance, Manhattan distance, or Minkowski distance.
Identify K-Nearest Neighbours: Select the K data points (neighbours) with the smallest calculated distances.
Make a Prediction:
- For Classification: Assign the new data point to the class that is most common among the K neighbours (majority vote).
- For Regression: The value for the new data point is the average or median of the values of the K neighbours.

Advantages of KNN:

Simplicity: The algorithm is very simple to understand and implement.
No Training Phase: As it’s a lazy algorithm, it requires no training time to build a model.
Flexibility: It can easily be adapted for multi-class classification and regression.
Good for Non-linear Data: It can form complex decision boundaries and works well with non-linear data.

Disadvantages of KNN:

High Computational Cost: Prediction time is long because it needs to compute distances to the entire training set for each prediction.
Memory Intensive: It needs to store the entire training dataset in memory.
Sensitivity to the value of K: Performance is highly dependent on the value of K. A very small K is sensitive to noise, and a very large K can blur the boundary between classes.
Curse of Dimensionality: It does not perform well with high-dimensional data as distances become less meaningful.
Sensitive to Feature Scaling: Features with different scales need to be normalized, or else features with larger ranges will dominate the distance calculation.

Q2. (a) What is clustering technique in Data Mining ? Give an example. Explain the following clustering methods briefly : (i) Density-based method (ii) Constraint-based method (b) Explain briefly the following text preprocessing stages used in typical NLP system : (i) Segmentation (ii) Tokenization (iii) Removal of stop words (iv) Stemming

Ans.

(a) Clustering Technique

In data mining, clustering is an unsupervised learning technique. Its objective is to partition a set of data points into several groups or “clusters”, such that data points in the same cluster are more similar to each other than to those in other clusters. Similarity is typically measured based on a distance metric, like Euclidean distance. In clustering, there are no predefined labels for the data; the algorithm discovers the natural groupings in the data on its own.

Example: A marketing team can use clustering to segment customers into different groups based on their behaviour. For instance, one cluster might be “high-spending, frequent visitors,” while another could be “budget-conscious, occasional shoppers.” This helps companies to create targeted marketing campaigns for each group.

Clustering Methods:

(i) Density-based method: Density-based clustering methods form clusters based on the density of data points. These methods assume that clusters are dense regions of high density separated by regions of low density. These methods can discover clusters of arbitrary shapes and are very effective at handling noise (outliers).

Core Idea: A point is part of a cluster if its neighborhood contains a certain number of other points (a minimum density).
Example Algorithm: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) . DBSCAN requires two parameters: `epsilon` (the radius of a point’s neighborhood) and `min_points` (the minimum number of points required within the `epsilon` radius for a point to be considered a core point). It identifies core points, border points, and noise points, allowing it to form non-spherical shaped clusters.

(ii) Constraint-based method: Constraint-based clustering is an extension of traditional clustering that incorporates domain knowledge or user preferences into the clustering process. These constraints guide how the algorithm groups the data, making the results more relevant and useful.

Types of Constraints:
- Must-link: Specifies that two data points must belong to the same cluster. Example: Two different customer records belonging to the same person.
- Cannot-link: Specifies that two data points must belong to different clusters. Example: Data related to competing companies.
- Size or Cardinality Constraints: Limits on the size, number, or total count of points in clusters.
Usage: It is useful when purely data-driven clustering yields results that are counterintuitive to business logic or known facts. By including constraints, the final clustering aligns better with user expectations.

(b) Text Preprocessing Stages

Text preprocessing is a critical first step in any Natural Language Processing (NLP) system. Its purpose is to clean and convert raw text data into a format that is suitable for a machine learning model. Here are some common stages:

(i) Segmentation: This is the process of breaking down a large body of text into smaller, meaningful units. Typically, this means dividing a document into individual sentences. Sentence boundary detection is important because sentences often represent a complete unit of thought. This is done by identifying punctuation marks like periods (.), question marks (?), and exclamation marks (!). Example: “Dr. Smith lives in the U.S. Isn’t that interesting?” would be segmented into two parts: “Dr. Smith lives in the U.S.” and “Isn’t that interesting?”.

(ii) Tokenization: Following segmentation, tokenization breaks each sentence down into individual words or “tokens”. These tokens become the basic units of analysis. Tokenization is usually done based on spaces and punctuation. Example: The sentence “The cat sat on the mat” would be tokenized into: [“The”, “cat”, “sat”, “on”, “the”, “mat”].

(iii) Removal of stop words: Stop words are very common words that often add little to no information to the meaning of a sentence (e.g., “a”, “an”, “the”, “in”, “is”). Removing these words reduces the size of the data and helps the model to focus on more meaningful words. Each NLP library has a predefined list of stop words, which can be customized for a specific application. Example: After removing stop words (“The”, “on”, “the”) from the list of tokens [“The”, “cat”, “sat”, “on”, “the”, “mat”], it would become: [“cat”, “sat”, “mat”].

(iv) Stemming: Stemming is the process of reducing a word to its root or “stem” form. The goal is to map related words (e.g., “running”, “ran”, “runs”) to the same token (“run”). This reduces the size of the vocabulary and ensures that words with the same core concept are treated as the same. Stemming is a rule-based process that often chops off suffixes from the end of a word, which can sometimes result in stems that are not actual words (e.g., “studies” -> “studi”). A more sophisticated technique is Lemmatization, which considers the meaning of the word to reduce it to its dictionary form (lemma).

Example: Using stemming, the words “computing”, “computer”, “computed” might all be reduced to “comput”.

Q3. (a) Explain snowflake and fact constellation schemas of dimensional modeling in a Data Warehouse with the help of an example for each. (b) Describe briefly the following types of OLAP architectures along with a block diagram for each : (i) ROLAP architecture (ii) MOLAP architecture

Ans.

(a) Snowflake and Fact Constellation Schemas

Dimensional modeling is a database design technique used in data warehouses. Its goal is to structure data in a way that is easy to understand and performant for queries.

1. Snowflake Schema

The snowflake schema is a logical extension of the star schema. In a star schema, a central fact table is directly linked to several dimension tables. In a snowflake schema, some of the dimension tables are normalized , breaking them down into multiple, related tables. This reduces redundancy and improves data integrity, but it requires more joins in queries, which can impact performance. Its structure resembles a snowflake, hence the name.

Example: Consider a sales data warehouse.

Fact Table: Sales (TransactionID, ProductKey, LocationKey, TimeKey, UnitsSold, Revenue)
Dimension Tables (in a Star Schema):
- DimProduct (ProductKey, ProductName, Category)
- DimLocation (LocationKey, Street, City, State, Country)
- DimTime (TimeKey, Day, Month, Year)

In a Snowflake Schema, the DimLocation table could be normalized:

DimLocation (LocationKey, Street, CityKey)
DimCity (CityKey, CityName, StateKey)
DimState (StateKey, StateName, CountryName)

Thus, the Sales fact table joins to DimLocation , which joins to DimCity , which joins to DimState , creating a snowflake-like structure.

2. Fact Constellation Schema

A fact constellation schema, also known as a galaxy schema , contains multiple fact tables that share some dimension tables. It’s a more complex structure that allows multiple business processes to be integrated into a single data warehouse model. Essentially, it is a collection of multiple star schemas that share dimension tables.

Example: Consider a model for a retail company that wants to track both sales and shipping.

Fact Table 1 (Sales): SalesFact (TimeKey, ProductKey, StoreKey, UnitsSold, Revenue)
Fact Table 2 (Shipping): ShippingFact (TimeKey, ProductKey, ShipperKey, UnitsShipped, ShippingCost)

Shared Dimension Tables:

DimTime (TimeKey, Day, Month, Year)
DimProduct (ProductKey, ProductName, Brand)

Other Dimension Tables:

DimStore (StoreKey, StoreName, City) – connected only to SalesFact.
DimShipper (ShipperKey, ShipperName) – connected only to ShippingFact.

In this model, both SalesFact and ShippingFact share the DimTime and DimProduct dimensions. This allows analysts to ask questions that cross both business processes, such as “How many units of a specific product were sold and how many were shipped last month?”.

(b) OLAP Architectures

Online Analytical Processing (OLAP) is a technology that enables users to analyze data from multiple dimensions. OLAP systems are categorized based on their underlying data storage architecture.

(i) ROLAP (Relational OLAP) Architecture

In a ROLAP architecture, data is stored in a standard Relational Database Management System (RDBMS) . The OLAP server acts as an intermediary layer on top of the relational database, translating the user’s multidimensional queries into standard SQL queries. Summary and aggregate tables are often pre-calculated and stored to improve performance, but the detailed data remains in the relational format.

Block Diagram: [ User Interface ] <–> [ ROLAP Server (Metadata Manager, Query Translator)] <–> [ Relational Database (RDBMS) ] | |— [ Data Warehouse (Fact and Dimension Tables)]

Characteristics:

Scalability: Can handle large volumes of data due to the capabilities of RDBMS.
Data Storage: No need to copy data separately, reducing data redundancy.
Performance: Can be slower for complex queries as aggregations are calculated at run-time.
Flexibility: More flexible due to the use of SQL.

(ii) MOLAP (Multidimensional OLAP) Architecture

In a MOLAP architecture, data is stored in a proprietary multidimensional database (MDDB) or “cube”. This cube is a pre-aggregated, optimized view of data that is extracted from a relational data source. Since all possible aggregations and hierarchies are already calculated and stored, MOLAP provides very fast query performance. Operations like slicing, dicing, and drill-down are extremely quick.

Block Diagram: [ User Interface ] <–> [ MOLAP Server (Cube Engine)] <–> [ Multidimensional Cube (MDDB) ] | |— (Loaded from data source) –> [ Data Warehouse (RDBMS) ]

Characteristics:

Performance: Very fast for analysis and queries because data is pre-aggregated.
Data Storage: Requires redundant storage as data is copied and stored in the cube, increasing storage requirements.
Scalability: Less scalable than ROLAP; large cubes can be difficult to manage (the “cube explosion” problem).
Analysis Capability: Better suited for complex analytical calculations.

Q4. (a) Explain Real-Time Data Warehouse (RTDW) architecture along with a block diagram. Give examples where these type of data warehouses are useful. (b) What are Data Marts ? How are they different from a Data Warehouse ? Discuss the design of the Data Marts.

Ans.

(a) Real-Time Data Warehouse (RTDW) Architecture

A Real-Time Data Warehouse (RTDW) , also called an Active Data Warehouse, is a data warehouse that receives and integrates data from operational systems almost instantly (within seconds or minutes). This is in contrast to traditional data warehouses, which are typically updated in batches (often overnight). The goal of an RTDW is to provide decision-makers with up-to-the-minute data, enabling them to react immediately to rapidly changing business conditions.

RTDW Architecture:

An RTDW architecture is designed to capture, transform, and load data from operational sources with minimal latency.

Block Diagram: [ Operational Systems (OLTP) ] –> [ Data Capture Mechanism (e.g., CDC, Message Queue)] –> [ Real-Time ETL/ELT Pipeline (Stream Processing Engines like Kafka, Spark Streaming)] –> [ Real-Time Data Store ] –> [ Analytics/BI Tools & Dashboards ]

Components:

Data Capture: Data is captured from sources near-instantly. This often uses Change Data Capture (CDC) techniques that read database logs, or messaging queues (like Kafka) where source applications publish events.
Real-time Data Integration: The data is fed into a stream processing engine that transforms and enriches it on the fly. This is unlike traditional batch ETL.
Data Storage and Serving: The transformed data is loaded into a warehouse that supports both fast writes and reads. This is often a hybrid of a traditional warehouse and a faster, real-time layer.
Analytics and Visualization: Users access the latest data through real-time dashboards and BI tools that continuously query the warehouse for updates.

Examples of RTDW Usage:

Fraud Detection: Financial institutions use RTDWs to analyze credit card transactions in real time to detect and block fraudulent patterns instantly.
Dynamic Pricing: E-commerce and airline companies use RTDWs to adjust prices in real time based on market demand, competitor pricing, and customer behavior.
Inventory Management: Retailers monitor inventory levels in real-time to optimize their supply chain, avoid stock-outs, and reduce overstocking.
Customer Relationship Management (CRM): Call center agents have the latest information about a customer’s most recent activities and purchases while interacting with them, enabling better service.

(b) Data Marts

A Data Mart is a subset of a data warehouse that is focused on a specific line of business, department, or subject area. While an enterprise data warehouse stores data for the entire organization, a data mart serves the needs of a specific group of users. For example, a company might have separate data marts for Sales, Marketing, and Finance.

Data Mart vs. Data Warehouse:

Characteristic	Data Mart	Data Warehouse
Scope	Departmental (e.g., Sales, Marketing)	Enterprise-wide (Entire Organization)
Subject	Single subject or business process	Multiple subjects
Data Source	Few sources, or from a central DW	Multiple, heterogeneous sources
Size	Smaller (<100 GB)	Large (100GB to many Terabytes or Petabytes)
Users	Limited users within a department	Many users across the organization
Development Time	Shorter (a few months)	Longer (many months to years)

Design of Data Marts:

There are two main approaches to designing data marts:

Top-Down Approach (Dependent Data Marts):
- In this approach, an enterprise-wide data warehouse is built first.
- Specific data marts are then created by extracting data from this central warehouse.
- These data marts are called dependent because they rely on the central data warehouse.
- Advantage: This ensures consistency and integration of data. All departments are working from a “single version of the truth.”
Bottom-Up Approach (Independent Data Marts):
- In this approach, data marts are built independently by pulling data directly from operational sources, without a central data warehouse.
- Each department builds its own data mart to meet its own needs.
- Advantage: Faster implementation and lower initial cost.
- Disadvantage: This can lead to “data silos,” where different data marts have inconsistent data and definitions, making enterprise-wide analysis difficult.

Typically, the data within a data mart is structured using a star schema , as it is simple, easy to understand, and provides efficient query performance for the specific department’s analytical needs.

Q5. Write short notes on the following: 4×5=20 (a) Data Lake (b) Star Schema (c) Market Basket Analysis (d) Cloud Data Warehouse

Ans.

(a) Data Lake

A data lake is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. Unlike a data warehouse, where data is transformed into a predefined schema before being stored, a data lake stores data in its native, raw format. This approach is called “schema-on-read,” meaning the data is only structured and transformed when it is needed for analysis. This provides greater flexibility and allows data scientists to use diverse datasets for tasks like machine learning and advanced analytics. Data lakes are often built on Hadoop (HDFS) or cloud storage (e.g., Amazon S3, Azure Blob Storage).

(b) Star Schema

The star schema is the simplest and most common form of dimensional modeling in a data warehouse. Its structure resembles a star, with a large central fact table surrounded by several smaller dimension tables .

Fact Table: Contains the numerical business measurements (measures), such as ‘sales amount’, ‘units sold’, and foreign keys to the dimension tables.
Dimension Tables: Contain the descriptive, attribute-based information, like ‘product name’, ‘city’, ‘date’. These tables are de-normalized, meaning they accept redundancy to improve query performance.

The star schema is easy to understand and provides fast performance for queries generated by BI tools because it requires fewer joins.

(c) Market Basket Analysis

Market Basket Analysis is a data mining technique used to discover which items are frequently purchased together in a customer’s shopping transaction. It is very popular in the retail industry and aims to find association rules between items. A classic example is “customers who buy bread also tend to buy butter.” The results of this analysis can be used to optimize store layouts, identify cross-selling and up-selling opportunities, and create targeted marketing campaigns. One of the most famous algorithms to perform this analysis is the Apriori algorithm , which uses measures like Support, Confidence, and Lift to find the rules.

(d) Cloud Data Warehouse

A cloud data warehouse is a database hosted on a cloud computing platform and delivered “as-a-service.” Unlike traditional on-premises data warehouses that require expensive hardware and maintenance, cloud data warehouses are fully managed by the provider. Leading examples include Amazon Redshift , Google BigQuery , and Snowflake .

Key benefits:

Scalability: Can easily scale up or down compute and storage resources as needed.
Cost-effectiveness: You pay only for the resources you use (pay-as-you-go), reducing upfront capital investment.
Performance: Designed to execute complex queries rapidly using massively parallel processing (MPP) architectures.
Managed Service: The provider takes care of hardware management, software updates, and security, reducing the burden on IT teams.

Download IGNOU previous Year Question paper download PDFs for MCS-221 to improve your preparation. These ignou solved question paper IGNOU Previous Year Question paper solved PDF in Hindi and English help you understand the exam pattern and score better.

IGNOU Previous Year Solved Question Papers (All Courses)

Thanks!

Telegram Channel	Join Now
FaceBook Page	Follow Us
Youtube Channel	Subscribe
WhatsApp Channel	Join Now

IGNOU MCS-221 Solved Question Paper PDF Download