본문 바로가기

빅데이터

스트림 프로세싱 with Faust - Windows

728x90

Faust는 Windowing을 쉽게 지원할 수 있는 기능을 가지고 있다. Windowing을 통해 지난 10분간 데이터의 분석 혹은 매 5분마다 1시간 간격의 데이터 분석과 같은 내용을 수행할 수 있다. 이번 포스팅에서는 Windowing을 사용하는 방법에 대해 알아보고자 한다.

 

Faust는 Hopping, Tumbling window를 지원한다.

Tumbling Windows

Tumbling window는 중복되지 않은 데이터에 대해 일정간격으로 분석할때 사용된다.

from dataclasses import asdict, dataclass
from datetime import timedelta
import json
import random

import faust


@dataclass
class ClickEvent(faust.Record):
    email: str
    timestamp: str
    uri: str
    number: int


app = faust.App("exercise7", broker="kafka://localhost:9092")
clickevents_topic = app.topic("com.udacity.streams.clickevents", value_type=ClickEvent)

uri_summary_table = app.Table("uri_summary", default=int).tumbling(
    timedelta(seconds=10)
) # Tumbling window 설정, 간격:10초


@app.agent(clickevents_topic)
async def clickevent(clickevents):
    async for ce in clickevents.group_by(ClickEvent.uri):
        uri_summary_table[ce.uri] += ce.number
        print(f"{ce.uri}: {uri_summary_table[ce.uri].current()}")


if __name__ == "__main__":
    app.main()

위 코드는 'com.udacity.streams.clickevents' topic에서 나온 데이터를 기준으로 10초 간격동안 uri의 group by당 개수를 uri_summary_table에 등록하여 print를 수행한다.

root@2f57ea8fd2c8:/home/workspace# python exercise6.7.solution.py worker
┌ƒaµS† v1.7.4─┬──────────────────────────────────────────┐
│ id          │ exercise7                                │
│ transport   │ [URL('kafka://localhost:9092')]          │
│ store       │ memory:                                  │
│ web         │ http://localhost:6066/                   │
│ log         │ -stderr- (warn)                          │
│ pid         │ 815                                      │
│ hostname    │ 2f57ea8fd2c8                             │
│ platform    │ CPython 3.7.3 (Linux x86_64)             │
│ drivers     │                                          │
│   transport │ aiokafka=1.0.6                           │
│   web       │ aiohttp=3.6.2                            │
│ datadir     │ /home/workspace/exercise7-data           │
│ appdir      │ /home/workspace/exercise7-data/v1        │
└─────────────┴──────────────────────────────────────────┘
starting➢ ◣
 😊
[2019-11-22 01:31:40,569: WARNING]: https://www.smith.com/main/tags/home/: 783 
[2019-11-22 01:31:40,625: WARNING]: https://robinson.com/app/search.php: 148 
[2019-11-22 01:31:40,627: WARNING]: http://martinez.com/tags/terms/: 267 
[2019-11-22 01:31:40,639: WARNING]: https://www.gomez.com/blog/category/main/register.html: 753 
[2019-11-22 01:31:40,641: WARNING]: https://sullivan-hamilton.com/tags/category/: 489 
[2019-11-22 01:31:40,642: WARNING]: https://eaton-butler.com/blog/wp-content/tags/about.php: 989 
[2019-11-22 01:31:40,643: WARNING]: https://weaver.com/post/: 281 
[2019-11-22 01:31:40,644: WARNING]: https://marquez.com/faq/: 202 
[2019-11-22 01:31:40,646: WARNING]: https://bradley.com/main.htm: 237 
[2019-11-22 01:31:40,646: WARNING]: https://rios.com/main/tags/tag/privacy.html: 177 
[2019-11-22 01:31:40,647: WARNING]: https://www.robinson-combs.com/: 327 
[2019-11-22 01:31:40,648: WARNING]: https://west.com/register.html: 771 
[2019-11-22 01:31:40,649: WARNING]: http://reyes.org/: 296 
[2019-11-22 01:31:40,653: WARNING]: http://www.mcdaniel.com/categories/wp-content/post.php: 662 
[2019-11-22 01:31:40,664: WARNING]: http://oliver-williams.com/author/: 718 
[2019-11-22 01:31:40,665: WARNING]: https://www.martinez-haney.com/index.asp: 15 
[2019-11-22 01:31:40,667: WARNING]: http://www.alvarez-smith.info/wp-content/terms.php: 517 
[2019-11-22 01:31:40,668: WARNING]: https://www.garcia-smith.com/tag/categories/post/: 839 
[2019-11-22 01:31:40,670: WARNING]: https://petersen.biz/: 199 
[2019-11-22 01:31:40,671: WARNING]: https://guzman-zimmerman.com/posts/posts/search/homepage.jsp: 847 
[2019-11-22 01:31:40,672: WARNING]: https://thomas.com/wp-content/blog/post/: 842 
[2019-11-22 01:31:40,673: WARNING]: http://clark.net/: 599 
[2019-11-22 01:31:40,674: WARNING]: http://www.mcclure.info/: 929 
[2019-11-22 01:31:40,675: WARNING]: http://barrett-martin.net/category/search/main/about/: 402 
[2019-11-22 01:31:40,676: WARNING]: https://www.farmer-deleon.com/posts/main/terms/: 202 
[2019-11-22 01:31:40,677: WARNING]: https://thomas.com/wp-content/blog/post/: 1748 
[2019-11-22 01:31:40,678: WARNING]: https://larson.info/faq.htm: 661 
[2019-11-22 01:31:40,678: WARNING]: https://kline.biz/main.htm: 7 
[2019-11-22 01:31:40,759: WARNING]: https://nichols-bailey.info/tags/blog/author/: 44 
[2019-11-22 01:31:40,762: WARNING]: https://www.jones-gomez.biz/homepage/: 39 
[2019-11-22 01:31:40,762: WARNING]: https://hood.com/search.html: 807 
[2019-11-22 01:31:40,763: WARNING]: http://johnson.com/tag/explore/category/: 414 
[2019-11-22 01:31:40,764: WARNING]: https://www.davis.info/register.asp: 677 
[2019-11-22 01:31:40,766: WARNING]: http://www.wright.net/register/: 291 
[2019-11-22 01:31:40,767: WARNING]: https://www.allen.info/blog/main/main/: 724 
[2019-11-22 01:31:40,768: WARNING]: http://wilkins.com/: 242 
[2019-11-22 01:31:40,769: WARNING]: http://www.moran-french.org/terms/: 357 
[2019-11-22 01:31:40,770: WARNING]: https://bailey.org/: 124 
[2019-11-22 01:31:40,771: WARNING]: http://chavez-thompson.info/main/: 190 
[2019-11-22 01:31:40,773: WARNING]: http://barrett-martin.net/category/search/main/about/: 1221 
[2019-11-22 01:31:40,774: WARNING]: https://www.lopez.net/category/index/: 959 
[2019-11-22 01:31:40,775: WARNING]: https://www.price.org/list/categories/blog/terms.htm: 682 
[2019-11-22 01:31:40,776: WARNING]: http://www.spears.net/home.html: 424 

 

Hopping Windows

Hopping window는 지난 x시간 동안의 데이터를 y간격을 반복하여 데이터를 분석하는 것이다. y간격은 x시간보다 길 수 없으며 일부 데이터는 2개의 window간에 중복되어 처리된다. 

from dataclasses import asdict, dataclass
from datetime import timedelta
import json
import random

import faust


@dataclass
class ClickEvent(faust.Record):
    email: str
    timestamp: str
    uri: str
    number: int


app = faust.App("exercise8", broker="kafka://localhost:9092")
clickevents_topic = app.topic("com.udacity.streams.clickevents", value_type=ClickEvent)

uri_summary_table = app.Table("uri_summary", default=int).hopping(
    size=timedelta(minutes=1),
    step=timedelta(seconds=10)
) # hopping window선언을 위한 size와 step


@app.agent(clickevents_topic)
async def clickevent(clickevents):
    async for ce in clickevents.group_by(ClickEvent.uri):
        uri_summary_table[ce.uri] += ce.number
        print(f"{ce.uri}: {uri_summary_table[ce.uri].current()}")


if __name__ == "__main__":
    app.main()

위 코드는 10초 간격으로 1분 동안의 데이터에 대해 분석하는 코드이다.

root@68f789ccd9f1:/home/workspace# python exercise6.8.solution.py worker
┌ƒaµS† v1.7.4─┬──────────────────────────────────────────┐
│ id          │ exercise8                                │
│ transport   │ [URL('kafka://localhost:9092')]          │
│ store       │ memory:                                  │
│ web         │ http://localhost:6066/                   │
│ log         │ -stderr- (warn)                          │
│ pid         │ 818                                      │
│ hostname    │ 68f789ccd9f1                             │
│ platform    │ CPython 3.7.3 (Linux x86_64)             │
│ drivers     │                                          │
│   transport │ aiokafka=1.0.6                           │
│   web       │ aiohttp=3.6.2                            │
│ datadir     │ /home/workspace/exercise8-data           │
│ appdir      │ /home/workspace/exercise8-data/v1        │
└─────────────┴──────────────────────────────────────────┘
starting➢ ◠
 😊
[2019-11-22 01:38:32,252: WARNING]: https://wise.com/: 953 
[2019-11-22 01:38:32,317: WARNING]: https://www.coleman-edwards.com/: 214 
[2019-11-22 01:38:32,321: WARNING]: http://www.calderon.com/index/: 559 
[2019-11-22 01:38:32,323: WARNING]: https://khan-butler.info/app/about.html: 938 
[2019-11-22 01:38:32,326: WARNING]: http://jackson.org/category/index/: 739 
[2019-11-22 01:38:32,335: WARNING]: http://www.benton.com/explore/privacy.html: 377 
[2019-11-22 01:38:32,340: WARNING]: http://herman.biz/search/: 871 
[2019-11-22 01:38:32,342: WARNING]: https://burch-holt.com/post/: 335 
[2019-11-22 01:38:32,344: WARNING]: https://www.molina-rivera.com/: 198 
[2019-11-22 01:38:32,346: WARNING]: http://www.maldonado.info/main.php: 462 
[2019-11-22 01:38:32,348: WARNING]: http://www.patterson.com/app/list/register.asp: 542 
[2019-11-22 01:38:32,352: WARNING]: https://www.west.info/main/categories/posts/faq.htm: 676 
[2019-11-22 01:38:32,355: WARNING]: http://white-miller.com/search/homepage/: 259 
[2019-11-22 01:38:32,357: WARNING]: https://smith.com/login/: 498 
[2019-11-22 01:38:32,360: WARNING]: http://mays.com/: 715 
[2019-11-22 01:38:32,361: WARNING]: https://roberts.com/login.htm: 446 
[2019-11-22 01:38:32,363: WARNING]: https://everett-lee.com/author/: 857 
[2019-11-22 01:38:32,365: WARNING]: https://cruz.com/wp-content/app/login/: 250 
[2019-11-22 01:38:32,368: WARNING]: http://www.lee.com/privacy.htm: 862 
[2019-11-22 01:38:32,370: WARNING]: https://wallace.com/post.php: 456 
[2019-11-22 01:38:32,372: WARNING]: https://www.atkins.com/search.jsp: 995 
[2019-11-22 01:38:32,374: WARNING]: https://www.chan.com/search/main/blog/category/: 659 
[2019-11-22 01:38:32,375: WARNING]: https://www.wright.com/category/login/: 461 
[2019-11-22 01:38:32,377: WARNING]: https://martinez.biz/home/: 877 
[2019-11-22 01:38:32,378: WARNING]: http://www.keith.com/terms/: 873 
[2019-11-22 01:38:32,380: WARNING]: http://www.hamilton-lee.net/: 735 
[2019-11-22 01:38:32,383: WARNING]: http://rice.org/index/: 946 
[2019-11-22 01:38:32,385: WARNING]: http://www.carpenter.com/categories/blog/posts/author/: 497 
[2019-11-22 01:38:32,387: WARNING]: https://chang.com/blog/explore/blog/terms/: 861 
[2019-11-22 01:38:32,388: WARNING]: https://adams-brown.biz/explore/tag/tag/post.php: 293 
[2019-11-22 01:38:32,390: WARNING]: https://ayala.com/: 622 
[2019-11-22 01:38:32,392: WARNING]: https://www.wilson-harris.com/category/wp-content/terms.php: 790 
[2019-11-22 01:38:32,395: WARNING]: https://erickson-hamilton.com/login.htm: 405 
[2019-11-22 01:38:32,396: WARNING]: http://www.herrera-navarro.info/explore/explore/login/: 234 
[2019-11-22 01:38:32,400: WARNING]: https://www.miller.com/homepage/: 45 
[2019-11-22 01:38:32,402: WARNING]: https://www.morales.com/search.htm: 196 
[2019-11-22 01:38:32,404: WARNING]: http://hardin.org/index.htm: 314 
[2019-11-22 01:38:32,406: WARNING]: http://www.davenport-ryan.com/: 976 
[2019-11-22 01:38:32,407: WARNING]: https://moore.info/main/wp-content/categories/search.html: 231 
[2019-11-22 01:38:32,409: WARNING]: http://ross-gutierrez.org/: 569 
[2019-11-22 01:38:32,410: WARNING]: https://jordan-oliver.net/tags/search/search/about.html: 895 
[2019-11-22 01:38:32,412: WARNING]: http://www.smith.info/search/wp-content/category/privacy.php: 488 
[2019-11-22 01:38:32,415: WARNING]: http://www.garcia.com/: 984 
[2019-11-22 01:38:32,416: WARNING]: http://ibarra-reese.com/blog/wp-content/posts/homepage/: 964 
[2019-11-22 01:38:32,418: WARNING]: http://www.hopkins.com/: 946 
[2019-11-22 01:38:32,420: WARNING]: http://www.stephens-barnes.biz/about/: 389 
[2019-11-22 01:38:32,421: WARNING]: https://www.molina-rivera.com/: 762 
[2019-11-22 01:38:32,422: WARNING]: http://www.reynolds.com/search/app/app/index/: 141 
[2019-11-22 01:38:32,425: WARNING]: https://www.dyer.info/explore/search/main/main/: 23 
[2019-11-22 01:38:32,426: WARNING]: http://www.white-carpenter.com/explore/main/posts/home/: 627 
[2019-11-22 01:38:32,428: WARNING]: https://juarez.com/categories/main/: 796 
[2019-11-22 01:38:32,431: WARNING]: https://www.robles.com/app/homepage.html: 22 
[2019-11-22 01:38:32,437: WARNING]: https://www.cline-wilson.biz/main.html: 909 
[2019-11-22 01:38:32,448: WARNING]: http://www.fuentes-holland.com/blog/category/blog/category/: 811 
[2019-11-22 01:38:32,450: WARNING]: http://www.maldonado.info/main.php: 595 
[2019-11-22 01:38:32,452: WARNING]: http://bell.com/explore/posts/wp-content/category/: 261 

 

728x90

태그